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1.  Introduction 

There  are  three  major  components  to  our  project.  The  first  is  in  the  area  of  procedural 
design  of  VLSI  circuits.  The  second  is  our  census  language  and  techniques,  and  the  third  is 
in  the  area  of  the  testing  of  VLSI  circuits. 

2.  Procedural  Approach  to  VLSI  Design 

2.1.  ALLENDE  [LaPaugh,  Mata,  Heng,  Lin,  Yeh] 

ALLENDE  is  a  new  language  for  VLSI  design  based  on  our  earlier  work  on  ALI  and 
Clay.  Several  new  ideas  have  been  introduced  to  make  it  both  easier  to  use  and  more 
efficient.  First,  a  layout  is  constructed  in  a  structured  way.  Second,  wires  which  were  expli¬ 
cit  in  our  earlier  languages  are  now  implicit.  This  eliminates  the  need  for  tedious  naming  of 
wires  and  resulting  errors.  Additionally,  the  structured  nature  of  ALLENDE  forces  all 
design  rule  violations  to  be  caught  by  the  system;  hence,  one  need  not  use  a  standard  design 
rule  checker. 

Internally.  ALLENDE  generates  not  CIF  but  a  higher  level  form  which  we  call  PIF. 
PIF  also  is  a  useful  tool  for  interfacing  other  design  tools.  We  are  currently  using  both 
ALLENDE  and  PIF  to  build  a  variety  of  other  tools.  Lin  has  written  a  channel  router  based 
on  e  Rivest-Fiduccia  algorithm.  Heng  has  written  a  pad  router;  he  is  now  making  it  work 
with  the  MOSIS  pad  frames.  The  Berkeley  PLA  generator  has  been  interfaced  to 
ALLENDE;  we  are  now  building  a  Weinberger  'sray  generator. 

2.2.  Clay  [North] 

The  Clay  procedural  layout  system  is  the  primary  design  tool  for  the  Princeton 
Reduced  Instruction  Set  Machine  (PRISM)  project.  The  control  path  bitslice  has  been  fabri¬ 
cated  and  is  in  test.  The  data  path  chip  will  be  submitted  to  MOSIS  for  fabrication  shortly. 
These  projects  have  given  us  experience  with  Clay  in  creating  large  VLSI  layouts.  The  Clay 
system  has  helped  to  shape  the  ALLENDE  design  language,  and  has  also  given  us  insight 
into  desirable  characteristics  for  CAD  tools  combining  both  procedural  code  and  graphics 
and  efficient  techniques  for  implementing  procedural  layout  systems. 

2.3.  Applications  [Steiglitz] 

ALLENDE  is  being  used  in  the  design  phase  of  a  project  to  study  cellular  automata. 
The  project  is  examining  the  capabilities  of  cellular  automata  as  a  model  of  general  non¬ 
linear  phenomena.  The  implementation  of  cellular  automata  as  VLSI  chips  will  allow  experi¬ 
mentation  that  is  too  time-consuming  or  expensive  using  general  purpose  computers. 
Currently,  a  multiprocessor  cellular  automaton  chip  with  programmable  next-state  functiou 
is  being  designed.  We  have  already  fabricated  and  tested  an  18  processor  cellular  automaton 
with  a  fixed  next-state  function;  it  achieves  about  1.4X10®  bit  updates  per  second. 
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3.  Census 

3.1.  Top/Down  Project  {Loprestij 

This  project  is  investigating  the  use  of  the  census  approach  to  parallel  computations. 
We  currently  have  a  four  processor  system  running;  recently  four  new  processor  boards 
arrived.  The  new  boards  are  based  on  the  32-bit  national  chip  set  and  include  floating  point 
and  half  a  megabyte  of  memory.  We  are  continuing  our  experiments  on  uses  of  the  system, 
focusing  on  a  variety  of  local  search  and  “simulated  annealing”  type  tasks. 

3.2.  ESP  [Park] 

A  prototype  version  of  the  ESP  controller  hardware  for  use  in  the  MMM  Project  is  now 
undergoing  testing.  The  individual  ESP  controllers  are  designed  with  TTL  hardware  and 
are  currently  on  a  multibus  wire  wrap  board.  These  ESP  controllers  are  being  used  to  inter¬ 
connect  two  Intel  8086  microprocessors  together  via  a  twenty  bit  wide  ESP  bus.  The  com¬ 
pleted  system  will  have  a  one  megabyte  memory  space  distributed  across  up  to  eight  proces¬ 
sors.  We  plan  next  to  use  the  system  as  a  test  bed  for  a  variety  c  f  issues  such  as  synchroni¬ 
zation,  reliability,  and  global  control. 

4.  Testing  [L&Paugh,  Steiglitz,  Lucas] 

We  are  continuing  to  exploit  ways  to  use  design  modification  to  simplify  testing.  We 
are  currently  empirically  studying  a  variety  of  methods  of  using  additional  logic  at  the  gate 
level  to  enhance  testability.  The  key  questions  are  the  tradeoffs  between  additional  logic 
and  easy  of  test  vector  generation. 

5.  Recent  Ph  J).  Theses 

M.  D.  Huang,  “Localized  Graph  Algorithms  with  Low  Page-Fault  Complexity". 

J.  M.  Mata,  “A  Methodology  for  VLSI  design  and  a  Constraint -Based  Layout  Language”. 

A.  S.  Yergis,  “Multiple  Fault  Detection  in  Digital  Circuits”. 
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Abstract 

ALLENDE  is  a  simple  and  powerful  layout  language,  associated  with  a 
structured  design  methodology  for  VLSI.  It  has  a  combination  of  features  that 
set  it  apart  from  the  existing  VLSI  layout  tools.  These  features  include  the  pro¬ 
cedural  language  approach,  the  structured  specification  of  the  layout,  the  use  of 
constraints  to  represent  the  layout,  and  the  use  of  an  intermediate  form  in  the 
implementation  of  the  system. 

In  ALLENDE  the  layout  is  described  hierarchically  as  a  composition  of 
cells;  absolute  sizes  or  positions  are  never  specified.  The  layout  description  is 
translated  into  linear  constraints,  which  express  design  rules  and  relative  posi¬ 
tion  of  the  layout  elements.  By  solving  these  constraints  we  obtain  the  absolute 
layout,  which  is  guaranteed  to  be  free  of  design  rule  violations.  Errors  in  the 
layout  description  are  immediately  detected  and  easily  located. 

ALLENDE  consists  of  five  procedures  to  be  called  from  a  Pascal  or  C  pro¬ 
gram,  allowing  the  user  to  describe  a  VLSI  layout.  A  lot  of  parameterization  is 
possible  when  specifying  layout  elements,  besides  the  ability  to  make  use  of  the 
full  pow'er  of  Pascal  or  C.  The  ALLENDE  layout  system  has  been  implemented  for 
the  nMOS  technology.  In  this  system  we  can  also  use  cells  generated  by  other 
layout  tools.  Our  layout  language  can  also  be  a  target  for  a  silicon  compiler.  ,.r 

1.  Introduction 

The  costs  associated  to  the  design  of  complex  chips,  the  need  to 
make  integrated  circuit  design  accessible  to  a  larger  number  of  people, 
and  the  need  for  more  powerful  tools  to  manage  design  changes,  have 
forced  the  reevaluation  of  VLSI  design  techniques.  Methods  to  enhance 
designer  productivity  have  to  be  explored. 

The  layout  phase  is  the  most  critical  phase  in  the  design  of 
integrated  circuits,  because  it  involves  expensive  tools  and  a  large 
amount  of  human  intervention,  and  also  because  of  its  effects  on 

This  work  was  supported  ir.  part  by  NSF  Grart  MCS-8004490,  DARPA  Contract 
NCOC 1 4-G2-K-0549 ,  ONR  Grant  NC0C14-e3-K-C275,  and  CAPES-Brazil. 
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production  costs  A  large  part  of  the  work  in  VLSI  is  dedicated  to  layout 
tools  and  techniques.  The  majority  of  the  layout  tools  are  graphics  edi¬ 
tors,  like  STICKS[16]  and  CAESAR[12j.  There  also  exist  layout  languages, 
procedural  or  only  descriptive,  like  LAVA[10],  PLATES[14],  HILL[b],  and 
ALI[6][  15] .  Layout  languages  have  had  limited  success,  mainly  because  of 
being  too  verbose,  limited  in  power  and  flexibility,  and  giving  poor 
results.  One  of  the  goals  of  layout  tools  is  to  produce  error-free  layouts, 
but  still  today  there  is  a  proliferation  of  layout  verification  programs,  like 
design  rule  checkers. 

In  this  work  we  concentrate  on  the  layout  problem.  We  have  two 
major  goals  in  mind; 

—  to  have  a  powerful  tool  that  allows  the  designer  to  obtain  easily  a 
correct  layout  for  his  design; 

—  to  have  a  component  that  can  be  extended  and  integrated  with 
other  components  of  a  design  system  associated  to  a  structured 
design  methodology. 

Our  approach  for  VLSI  layout  is  basically  to  use  a  language  to 
hierarchically  describe  the  circuit  structure  and  layout  topology,  and  to 
use  linear  constraints  to  internally  represent  the  layout. 

Our  layout  system,  ALLENDE[8j[9j,  has  a  combination  of  features 
that  set  it  apart  from  the  existing  layout  tools.  The  procedural  language 
approach  and  the  representation  of  the  layout  as  constraints  distinguish 
ALLEND E  from  all  graphics-based  layout  editors  and  from  most  layout 
languages.  The  difference  with  other  existing  or  proposed  constraint- 
based  procedural  layout  languages  consists  in  the  way  the  layout  is 
described,  the  kind  of  constraints  generated,  and  also  the  form  of  imple¬ 
mentation  of  the  system.  The  net  result  is  the  power,  flexibility,  and 
efficiency  achieved  by  ALLENDE. 

2.  The  ALLENDE  Layout  System 

2.1.  Basic  Ideas 

Oui  approach  to  tackle  the  layout  problem  is  to  have  a  language  for 
the  description  of  the  layout  structure.  From  the  textual  representation 
of  the  layout,  constraints  are  generated,  solved,  and  then  the  physical 
layout  is  obtained.  The  characteristics  of  the  language  depend,  of  course, 
on  the  class  of  objects  manipulated  and  how  they  are  manipulated. 

In  our  system,  the  only  object  that  we  have  are  cells.  A  cell 
corresponds  to  a  rectangle  with  internal  structure  and  parameter  wires. 


Fig.  1  -A  nand  cell 

A  composition  of  two  cells  is  made  by  specifying  their  relative  posi¬ 
tion  (left,  right,  above,  below),  and  the  result  is  another  cell.  A  single 
cell  can  also  be  rotated  or  flipped. 


Fig.  2  -  Composition  of  cells 

When  composing  cells  there  is  no  need  to  worry  about  matching  cell 
size  or  wire  spacing  (except  for  cells  of  fixed  size);  the  conditions  for 
matching  are  expressed  in  the  linear  constraints  generated,  and  then  the 
size  of  a  cell  will  depend  on  the  context  in  which  it  is  instantiated. 

The  first  basic  idea  is  to  describe  the  layout  hierarchically  as  a 
composition  of  cells.  At  the  bottom  level  of  the  hierarchy  there  will  be 
system  cells  (contact,  transistor,  etc.)  or  rigid  cells  (previously  defined 
layout  pieces). 
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(A  above  (B  left  C) )  left  (rotated*)  D) 


Fig.  3  -  Structured  layout  description 

When  using  a  system  cell  we  specify  the  wires  on  each  side  of  the 
cell.  When  two  cells  are  composed,  the  wires  to  be  connected,  and  the 
parameter  wires  for  the  resulting  cell,  are  determined  by  context. 

The  second  basic  idea  is  to  use  an  intermediate  language  to 
represent  the  layout  structure.  This  language  should  be  different  from 
the  user  language,  but  at  a  higher  level  than  a  mask  level  language,  like 
CIF  (Caltech  intermediate  Form)  [l].  The  intent  is  to  separate  language 
aspects  from  layout  aspects,  or  user  aspects  from  system  aspects.  For 
layout  or  system  aspects  we  mean  constraint  generation  and  layout  pro¬ 
duction.  For  language  or  user  aspects  wre  mean  the  high  level  language 
used  to  describe  the  layout,  and  its  implementation. 

This  intermediate  language  brings  flexibility  to  the  layout  system. 
There  may  be  more  than  one  user  language,  even  a  graphics  language. 
The  implementation  of  the  intermediate  language  and  of  the  user 
languages  are  independent,  and  easier  than  the  implementation  without 
an  intermediate  language.  The  intermediate  language  deals  with  the  lay¬ 
out  structure  only,  while  the  user  language  may  have  all  the  power  of  a 
procedural  language.  The  idea  then  is  to  extract  all  the  layout  informa¬ 
tion  from  the  user  program,  and  then  process  it. 


simulator 


Fig.  4  -  The  role  of  the  intermediate  language 

Based  on  these  Ideas  we  built  a  layout  system.  There  is  a  user 
language,  ALLENDE,  that  is  no  more  than  Pascal  or  C  with  a  few  pro¬ 
cedures  and  functions  added,  and  an  intermediate  language,  PIF.  The  out¬ 
put  of  the  user  program  is  the  layout  structure  in  intermediate  form 
(PIF),  from  which  linear  constraints  are  generated,  solved,  and  the  abso¬ 
lute  layout  in  CIF  is  produced. 
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j  layout 
»  in  CIF 
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Fig.  5  -  The  layout  generation  process 

This  layout  system  works  for  the  nMOS  technology,  and  extensions 
for  other  technologies  are  under  study  We  use  CIF  to  describe  the  final 
layout,  although  other  languages  could  be  used;  in  the  same  way,  Pascal 
and  C  were  chosen  just  for  reasons  of  convenience.  We  use  only  right- 
angle  geometry.  The  coordinate  system  is  a  half-lambda  grid. 
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2.2.  Describing  Layouts  Using  Linear  Constraints 

As  shown  in  fig.  5,  in  our  layout  system  there  are  different 
representations  for  the  layout:  the  user  representation  (in  ALLENDE),  the 
intermediate  form  (in  PIF),  the  symbolic  form  with  constraints,  and  the 
mask-level  representation  (in  CIF).  In  CIF  the  layout  objects,  mainly  rec¬ 
tangles,  are  described  in  terms  of  absolute  coordinates;  the  coordinate 
unit  is  one  hundredth  of  a  micron. 

Our  symbolic  representation  of  the  layout  is  in  terms  of  the  relative 
coordinates  of  the  layout  elements;  the  relation  between  these  coordi¬ 
nates  is  expressed  by  a  set  of  linear  constraints.  The  variables  in  these 
constraints  are  the  X  and  Y  coordinates  of  the  objects  in  the  layout.  The 
constraints  describe  the  interaction  between  the  objects,  and  may  come 
from  the  geometric  design  rules,  connectivity,  and  hierarchy  in  the  lay¬ 
out  description.  By  solving  the  constraints  and  replacing  the  values 
obtained  for  the  coordinates  in  the  symbolic  layout  we  obtain  the  abso¬ 
lute  layout. 

The  set  of  constraints  is  solved  in  such  a  way  to  minimize  the  total 
area.  Due  to  the  large  number  of  layout  elements,  the  constraints  ought, 
to  be  as  simple  as  possible,  in  order  to  reduce  the  complexity  of  the  solv¬ 
ing  algorithm.  We  assume  that  the  X  and  Y  constraints  are  decoupled; 
this  means  that  no  constraint  involves  bothX  and  Y  coordinates,  and  that 
constraints  involving  X  and  Y  coordinates  are  independent.  We  don't 
allow  constraints  to  be  related  by  the  operator  or,  for  example.  By  decou¬ 
pling  the  X  and  Y  constraints  the  compaction  problem  is  made  equivalent 
to  solving  two  independent  sets  of  constraints. 

The  whole  layout  is  represented  using  constraints  of  the  form: 

=  Xj 

Xi  -  Xj  5:  d  (d  >  0,  integer) 

Xi  -  Xj  =  e  (e  >  0,  integer) 

We  have  an  efficient  algorithm  to  solve  such  constraints,  described 
in  [7],  The  algorithm  is  based  on  the  topological  sort. 

Each  constraint  of  the  form  z*  -  z;-  =  e  corresponds  to  a  rigid,  cell. 
The  user  may  control  the  number  of  constraints  by  constructing  the  lay¬ 
out  in  several  steps:  making  rigid  cells  and  using  them  at  the  next  level  of 
the  cell  hierarchy.  If  there  are  no  constraints  of  the  form  xi  -  xi  -  e  in 

the  set  of  constraints  generated,  there  is  always  a  solution  to  the  equa¬ 

tions,  since  our  way  of  generating  constraints  doesn’t  create  "cycles". 
The  only  situation  when  there  is  no  solution  to  the  set  of  constraints 
occurs  when  a  rigid  cell  doesn’t  fit  the  context  where  it  is  instantiated. 
For  example,  some  condition  may  force  a  larger  separation  between  two 
parameter  wires  of  a  rigid  cell. 


2.3.  The  Intermediate  Language  PIF 

The  idea  behind  PIF  is  to  represent  the  layout  structure  in  a  com¬ 
pact  way,  as  in  fig.  3.  The  objects  in  the  layout  are  cells,  and  the  opera¬ 
tors  specify  position  or  orientation.  Our  layout  representation  is  exactly 
like  an  arithmetic  expression;  operands  are  cells,  binary  operators 
specify  relative  position  (left,  right,  above,  below),  and  unary  operators 
specify  orientation  (rotation  or  flipping).  Operator  precedence  is  as  usual, 
and  parentheses  can  be  used  to  change  precedence. 

A  layout  in  PIF  is  a  structured  combination  of  cells,  while  a  layout 
in  GIF  is  a  combination  of  rectangles  and  other  elements  in  any  order. 
The  result  of  the  interp; station  of  a  PIF  program  is  a  set  of  constraints, 
the  layout  in  symbolic  form  (no  absolute  coordinates  assigned),  and  the 
circuit  (at  the  switch  level)  for  simulation. 

An  example  cl  a  PIF  program  is  given  in  fig.  6.  The  code  "C[.  .  d2m3 
]"  represents  a  cell  that  is  the  contact  of  two  wires:  diffusion  2  lambda 
wide  coming  from  the  right  and  metal  3  lambda  wide  coming  from  the 
bottom  (the  two  ","s  indicate  that  no  wire  comes  from  the  left  or  top).  "A" 
means  above,  "+"  means  crossing,  the  parentheses  delimit  a  cell,  and 
'Sexample'  specifies  the  name  example  for  the  cell. 


Sexamp 1 e 
( 

L 

mKma  v; — j 

C  t  .  .  d2  m3  ) 

fcfl.vXy.v.V  : 

A 

♦  [  p2  m3  p2  m3  1 

’  VW-'A  ‘ 

> 

A  A  a.  .... 

Hg.  6  -  A  PIF  program 

In  PIF,  layout  construction  is  like  expression  evaluation.  If  we  use  a 
grammar  to  describe  this  layout  language,  the  construction  of  the  layout 
can  be  done  -when  parsing,  in  a  bottom-up  fashion.  In  fact,  this  is  similar 
to  the  way  the  system  for  typesetting  mathematics  EQN  [4]  was  designed 
and  implemented.  In  EQN,  equations  are  pictured  as  a  set  of  "boxes", 
pieced  together  in  various  ways. 

Fig.  0  shows  the  grammar  that  describes  the  PIF  language,  exactly 
in  the  same  format  submitted  to  the  compiler  generator  YACC  [3j.  As  a 
PIF  program  is  supposed  to  De  generated  by  a  program,  and  not  by  the 
user,  we  tried  to  make  the  language  compact  and  easy  to  process,  not 
worrying  about  readabililj ,  although  one  can  read  a  PIF  program. 
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{  SS  *  syscel 1 (SI , S3 ,S4 ,S5 ,S6 ,0.0) ;  > 

I  cellcode  INTEGER  INTEGER  '[’  wires  wires  wires  wires 
{  $$  =  syscel 1 ($1 ,SE,$6.$7.S8,S2.S3)  ;  ) 

!  SYSCELL 

(  SS  «  cellccde  »  SI;  ) 

s  wire 

1  *  I  ’  wlrel  1st  'I  * 

{  $$  =  $2;  ) 


:  wire 

I  wlrel 1st  wire 

<  SS  =  wl rel 1 st ( SI .$2 )  ;  > 


I 


LAYER  INTEGER 

(  $$  =  startwl r e( $1 ,S2 )  ;  ) 

«  » 

{  $$  ■  startwlre! NOLAYER .0) ; 


) 


Fig.  7  -  PIT  grammar 


-  9  - 


C 


The  lexical  elements  are  the  following: 


INTEGER:  an  unsigned  integer; 

L4YER:  m,  p,  d  (for  metal,  polysillicon,  and  diffusion); 


SYSCELL: 


POSITION: 

ORIENTATION: 

CELLNAME: 

RIGIDNAME: 

WIRENAMES: 

COMMENT: 


C,  X,  T,  1,  P,  +  ,  W,  J,  E,  N 

(for  contact,  independent  contact,  transistor,  implanted 
transistor,  pullup,  crossing,  line,  jog,  empty,  and  nullcell, 
respectively); 

L,  R,  A,  B  (for  left,  right,  above,  and  below); 
rO,  r90,  rieo,  r270,  fO,  J90,  J45,  J135 
(for  rotation  and  flipping); 

SceLLname 

arrigidname 

\uiirenames\ 

/comment  / 


The  smallest  object  that  we  handle  is  a  system  cell,  which 
corresponds  to  a  structure  built  according  to  the  design  rules  and  that 
forms  a  contact,  a  transistor,  and  so  on.  The  cells  contact,  transistor,  and 
pullup  are  the  usual  ones.  An  implanted  transistor  is  a  pullup  with  the 
gate  not  connected  directly  to  the  source.  An  independent  contact 
represents  the  connection  of  wires  of  the  same  layer  independently  of 
other  layers;  it  is  basically  used  to  represent  independent  overlapping 
wires  in  a  cell.  The  crossing  of  wires  in  the  layout  has  to  be  specified,  and 
the  cell  crossing  is  used  for  this  purpose.  A  jog  represents  a  bend  in  a 
wire,  that  can  move  in  two  directions.  The  cell  empty  represents  a  cell 
with  nothing  inside,  and  it  is  useful  for  top-down  design.  The  cell  nullcell 
has  no  effect;  it  is  like  an  identity  element  for  the  placement  operation, 
and  it  is  useful  to  simplify  some  programs  that  describe  a  layout,  line 
means  a  single  wire  or  a  set  of  parallel  wires;  it  is  used  in  situations  like 
the  one  shown  in  fig.  8. 


A  left  (B  above  LINE) 


Fig.  B  -  Use  of  the  "line"  system  cell 


A  rigid  cell  is  a  cell  of  fixed  size;  its  code  is  in  CIF,  with  a  header  giv¬ 
ing  information  like  size  and  parameter  wires,  rigidname  is  the  name  of 
a  file  containing  the  rigid  cell,  cellname  is  just  the  name  of  a  cell,  used 
mostly  for  debugging  purposes. 

Wire  names  are  related  to  a  cell,  and  they  refer  to  the  parameter 
wires  of  the  cell.  One  of  the  uses  of  wire  names  is  to  give  information  for 
simulation. 


2.4.  The  ALLENDE  Language 

ALLENDE  (A  Layout  Language  Effective  for  11MOS  Design)  is  a  set  of 
procedures  and  functions  to  be  called  from  a  Pascal  or  C  program,  allow¬ 
ing  the  user  to  describe  a  VLSI  layout.  Basically,  the  user  describes  cells 
and  their  relative  placement.  Cell  hierarchy  comes  naturally  by  using 
procedures  to  describe  cells. 

The  user  can  make  use  of  the  full  power  of  Pascal  or  C.  The  basic 
procedures  and  functions  to  describe  the  layout  allow  a  great  deal  of 
parameterization,  thus  allowing  the  user  to  obtain  completely  different 
layouts  just  by  changing  a  parameter  in  the  program. 

The  output  of  the  user  program  is  the  layout  in  intermediate  form 
(PIF),  from  which  linear  constraints  are  generated,  solved,  and  the  abso¬ 
lute  layout  in  CIF  is  produced.  The  layout  obtained  is  guaranteed  to  be 
free  of  design  rule  violations. 

The  basic  idea  of  the  ALLENDE  language  is  the  same  as  in  the  PIF 
intermediate  form:  the  layout  is  described  hierarchically  as  a  composi¬ 
tion  of  cells.  The  difference  now  is  that  the  user  has  available  all  the 
power  of  a  procedural  language. 

The  following  procedures  allow  the  user  to  describe  a  layout: 

syscell(kind,  wire  1 ,  wire  2,  wire  3,  wire  4, ratio  ) 
extc  ell  (filename) 
place  (operator) 
begincell(ceUnamt ) 
e  ride  e  ll  ( wire  name  s ) 

sysceLL  specifies  a  system  cell,  extcell  specifies  an  external  cell. 
begincell  and  endcell  are  used  to  delimit  a  composite  cell,  place  specifies 
the  operator  to  be  applied  for  a  cell  composition. 

Since  what  these  procedures  do  is  to  generate  some  intermediate 
code  to  be  interpreted  later  on,  Pascal  (or  C)  commands  can  be  inter¬ 
mixed  with  calls  to  these  procedures.  The  user  can  also  define  new  pro¬ 
cedures  in  terms  of  these  basic  procedures. 
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Fig.  S  shows  the  ALLENDE  program,  in  C,  that  generates  the  PIF 
program  of  fig.  6.  The  generation  of  PIF  code  is  straightforward:  each  call 
to  one  of  the  five  procedures  listed  previously  causes  the  generation  of 
the  corresponding  code  in  PIF.  For  example,  endcell("  ")  generates  only 
the  character 


♦include  " / va/a 1 lende'usr/def.h" 


malni  ) 

< 

beg i ncel 1 
sysce 1 1 
place  ( 
sysce 1 1 
endee 1 1  < 

} 


(  "example"  >  ; 

(  CONTACT,  nowire,  nowire.  d1ff(2),  meta 1 ( 3 > ,  0  >; 
ABOVE  ); 

<  CROSSING,  po 1 y ( 2 ) .  metal(3>.  poly<2>,  metal(3>.  0 

..  ,  . 


)  ; 


Fig.  9  -  An  ALLENDE  program 

The  procedure  syscell  allows  the  specification  of  a  system  cell. 
There  are  10  kinds  of  system  cells:  CONTACT,  ICONTACT,  TRANSISTOR, 
ITR4NSIST0R,  PULLUP,  CROSSING,  LINE,  JOG,  EMPTY,  and  NULLCELL. 
These  correspond  to  the  system  cells  described  in  the  previous  section. 

The  only  place  where  the  user  has  to  specify  wires  is  for  system 
cells,  where  he  gives  the  wires  at  left,  top,  right,  and  bottom  of  the  cell. 
The  functions  metal  (width),  poly  (width),  diff (width),  nowire,  and  the 
more  general  function  wire  (layer, width),  allow  the  specification  of  a  wire 
( metaL(widih )  is  just  a  shorthand  for  wire  (METAL, width),  for  instance). 
Here  is  one  place  where  a  lot  of  parameterization  can  be  done,  layer  and 
width  can  be  parameterized;  also,  a  wire  can  be  a  variable.  It  is  also  pos¬ 
sible  to  have  more  than  one  wire  at  one  side  of  the  cell,  allowing  for  over¬ 
lapping  wires  or  more  complex  cells. 

The  operators  to  be  applied  to  the  cells  can  be;  LEFT,  RIGHT, 
ABOVE,  BELOW,  R0TATED0,  R0TATED90,  ROTATED  180,  ROTATED270, 
FLIPPEDO,  FLIPPED90,  FL1PPED45,  FLIPPED  135. 

The  procedure  extcell  specifies  a  filename  containing  an  external 
cell.  The  external  cell  can  be  in  intermediate  form,  in  which  case  we  call 
it  flexible,  or  it  can  be  in  CIF,  in  which  case  we  call  it  rigid.  One  special 
kind  of  external  cells  are  pads,  which  are  rigid  cells.  There  is  a  pad 
library. 
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The  procedures  begincell  and  endcell  are  used  to  delimit  a  compo¬ 
site  cell.  The  parameter  of  beginceU,  is  a  character  string  containing  the 
name  of  the  cell;  the  name  may  be  blanks  only,  in  case  we  don't  want  to 
name  the  cell.  The  cell  name  is  used  to  trace  errors.  If  the  cell  is  named 
it  will  correspond  to  a  symbol  in  the  CIF  code,  thus  preserving  the  cell 
hierarchy,  useful  for  programs  that  display  CIF. 

The  parameter  of  endcell  is  a  string  containing  the  names  of  the 
parameter  wires  (usually  only  blanks).  These  names  will  appear  in  the  CIF 
code,  and  they  are  useful  for  simulation. 

Fig.  10  describes  a  nand  cell,  and  gives  some  examples  of  parame¬ 
terization.  Fig.  11  describes  a  binary  tree  that  uses  the  nand  flexible  cell 
generated  previously. 


program  nand ( output  )  : 
const 

♦include  Vva/al lende/usr/const.h" 
type 

♦Include  “/va/al lende/usr/type.h“ 

var  power . ground , p2 , d2 :  wlretype; 

♦Include  “/va/al 1 ende/usr /pro: . h" 

procedure  con t a ct ( wl , w2 . w3 , w4 :  wlretype); 
begin  sysce 1 1 ( CONTACT , wl , w2 , w3 , w4 , 0 ) ;  end; 

procedure  cr os s I ng ( w 1 , w2 :  wlretype); 

begin  syscel 1 ( CROSS  1 NG ,wl , w? , wl , w2 ,0 ) ;  end; 

procedure  above; 

begin  placelABOVE > ;  end; 

procedure  nand; 
beg  I  n 

begin cell! ' nand '  ); 

beg  I  nee  1 1 (  ' co 1 umn 1 '  >; 

syscel 1 (LINE .power , now  Ire, power , now i re ,0 ) ; 
contactlrowlre, now  I r e , p2 , p2  )  ; 
crossing (ground, p2); 
endee 1 1 <  *  '); 

place(LEFT) ; 
beg  I  nee ’ 1 ( 'column2' ); 

contact ( power .nowire .power . d2 ) ; 
syscel 1 (PULLUP . now !re,d2,p2,d2.4); 
syscel 1 (TRANSISTOR .now Ire,d2.p2,d2,0) ; 
syscel 1 (TRANSISTOR ,p2,d2, now  1  re , d2 ,0 ) ; 
contact(ground,d2. ground, nowl re); 
endee  111’  ’  >  ; 

p  1  a  <_  e  (  L  E  F  T  i  ; 
begtneel 1(  'column 3'  ); 

o-  oss  I  ng  (  power  ,  r.  2  I  : 
contact(p2,p2, nowire , now  1 r  e ) ; 
c  ont » 1 1  2  ,  icwl  re  ,  row  I  r  e .  p2  >  ; 

cross  i no( gr  _und . p2  )  ; 
endc*  11('  '  )  ; 

e  i  e  e  1  1  (  ’  )  ; 

end  ; 


above ; 
above ; 


above ; 
above ; 
above ; 
above ; 


above ; 
above ; 
above ; 


L  :  i  i : 

power:*  wire(MtlAL,5); 
ground:"  w I  re ( MET AL  .  5  )  ; 
p2 : *  wire(P0LY,2); 

d2 : *  wlrelOIFF  .2); 

nand  ; 

end  . 


Fig.  10  -  Nand  cell 
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program  b I nar y tr ee ( outp ut ) : 
const 

♦include  "/va/al lende/usr /const . h" 
type 

♦include  "/va/al lende/usr/type.h“ 
♦include  "/va/allende/usr/proc.h“ 

procedure  root; 
beg  i  n 

ex tee  1 1 ( ’  nand . p i f • ); 

end ; 

procedure  btreein:  integer); 
beg  i  n 

beg  1  nee  1 1 ( ’btree  * ) ; 

if  n  *  1  then  root 
e 1 se  beg  1 n 
root ; 

p lace(£ BOVE > ; 
beg  i  nee  Ilf  '  )  ; 
btree< n- 1  ) ; 
place (LEFT > ; 
btree( n-1 ) ; 
endeel 1 ( ’  • ) ; 

end  (if); 
endeel 1 ( *  '  )  ; 

end  ; 

beg  i  n 

btree ( 4  )  ; 

end  . 


Fig.  1 1  -  Binary  tree 
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Parameterization  can  be  done  extensively,  and  it  simplifies  the 
modification  of  layouts.  For  example,  a  wire  can  be  fully  parameterized, 
as  ’  power"  in  fig  10.  Parameters  in  the  program  not  related  to  the 
ALLENDE  basic  procedures  can  also  be  used  to  produce  general  cells;  one 
example  is  the  depth  of  the  tree  in  fig.  11. 

Errors  may  occur  during  compilation  of  the  user  program,  execu¬ 
tion,  interpretation  of  the  PIF  program,  and  solving  the  constraints. 
Compilation  and  execution  errors  are  the  usual  ones,  detected  by  the 
Pascal  or  C  compiler  or  during  execution.  In  case  of  error  during 
interpretation  of  the  intermediate  form  describing  the  layout,  the  PJF 
interpreter  identifies  the  cell  where  the  error  occurred.  The  only  possible 
error  during  the  process  of  solving  the  constraints  is  a  rigid  cell  not 
fitting  the  context  where  it  is  instantiated;  in  this  case,  the  solver  points 
to  that  cell  which  caused  the  error. 


2.5  The  Complete  ALLENDE  System 

The  ALLENDE  layout  system,  as  it  stands  now,  is  comprised  of  four 
programs: 

ALLENDE 

It  takes  a  program  and  generates  the  layout  (rigid  cell),  the  circuit 
at  the  switch  level,  or  a  flexible  cell  to  be  used  later  on. 

-  SIMULATE 

This  program  is  a  switch-level  simulator,  based  on  [13]. 

-  CIF2CELL 

The  idea  of  C1F2CELL  is  to  make  possible  the  use  of  CIF  code  gen¬ 
erated  by  other  tools.  It  basically  finds  the  cell  interface.  The  CIF 
code  is  then  used  as  a  rigid  cell. 

CIF2CIRCLTT 

When  rigid  cells  are  used,  it  is  not  possible  to  extract  the  circuit 
from  the  '  high-level”  description  of  the  layout.  In  this  case,  the  cir¬ 
cuit  is  extracted  from  the  layout  in  CIF.  Our  program  interfaces  to 
a  circuit  extractor,  and  currently  we  are  using  the  Berkeley  circuit 
extractor  mextra  [11]. 
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Fig.  12  -  The  ALIUNDE  system 

The  programs  composing  the  ALLENDE  system  were  written  in  C 
and  Pascal.  The  system  runs  under  the  Berkeley  UNIX  operating  system, 
and  is  currently  used  on  a  VAX  750. 

3.  Advantages  of  our  approach 

By  making  layout  design  similar  to  software  design  we  can  apply 
our  knowledge  about  programming  to  this  new  activity.  The  main  issues 
associated  to  the  use  of  a  procedural  language  for  layout  description  are 
the  following. 

—  hierarchical  design 

Hierarchy  already  exists  in  programming  languages,  in  the  form  of 
procedures,  and  the  programmer  is  used  to  it.  Use  of  this  hierarchy 
for  layout  design  enforces  a  good  design  methodology. 


—  expressive  power 

All  the  power  of  a  procedural  language  is  available  to  the  designer. 
Parameters,  conditional  statements,  iterative  statements,  etc.,  can 
be  used.  Parts  of  the  layout,  such  as  a  PLA  or  a  routing  cell,  in¬ 
stead  of  being  described  by  the  designer  can  be  generated  by  a 
program. 

—  parameterization 

Having  a  layout  design  which  produces  different  layouts  for 
different  values  of  a  set  of  parameters  is  extremely  useful.  Exam¬ 
ples  are  the  parameterization  of  the  layer  or  width  of  a  wire, 
transistor  ratios,  or  size  of  shift  registers  This  parameterization 
can  involve  local  or  global  changes  in  the  layout,  and  it  simplifies 
the  modification  of  layouts.  It  also  allows  general  cells,  whose 
characteristics  depend  on  the  values  of  some  parameters  (like  a 
routing  cell). 

—  documentation 

If  the  layout  is  described  using  a  programming  language  we  have 
some  documentation  on  the  design.  This  helps  other  people,  and 
even  the  own  designer,  to  understand  the  design. 

—  open  ended  tool 

Graphics  editors  tend  to  be  closed  tools  in  the  sense  that  it  is  hard 
to  automate  the  layout  process  beyond  what  the  original  design  of 
the  system  allowed.  Procedural  languages  are  much  better  in  this 
respect.  The  input  to  a  compiler  is  text  that  can  be  generated  by 
humans  or  by  a  program,  while  a  graphics  editor  has  an  interactive 
nature,  being  designed  basically  to  accept  commands  generated  by 
humans. 

—  no  expensive  equipment 

With  a  programming  language  for  layout  description  we  can  avoid 
the  need  for  sophisticated  computing  resources.  A  standard  al¬ 
phanumerical  terminal  in  combination  with  a  small  plotter  or  CRT 
display  shared  by  several  designers  can  be  used  effectively  for  lay¬ 
out  design. 

Much  is  gained  by  not  assigning  absolute  positions  to  the  layout  ele¬ 
ments  directly,  but  by  representing  the  relations  among  elements  by  a 
set  of  constraints.  Implicit  layout  rules  and  cell  flexibility  are  the  main 
benefits  of  representing  the  layout  as  constraints.  The  design  rules  are 
implicit  in  the  constraint  generation  process.  This  design  rule  free 
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environment  relieves  the  designer  of  details  that  can  cloud  more  global 
and  important  issues.  What  is  more  important,  the  layout  obtained  can  be 
guaranteed  to  be  free  of  design  rule  violations,  thus  eliminating  the  need 
for  design  rule  checking. 

If  a  piece  of  a  layout  is  specified  in  absolute  positions,  serious  prob¬ 
lems  are  likely  to  arise  when  different  pieces  are  put  together.  In 
constraint-based  layout  systems  the  absolute  sizes  or  positions  are  deter¬ 
mined  by  the  system  after  solving  the  set  of  constraints.  This  makes 
cells  flexible,  with  the  possibility  of  being  stretched  in  order  to  combine 
correctly  with  other  cells. 

4.  Conclusions 

The  ALLENDE  layout  system  has  been  used  by  a  number  of  people  in 
the  design  of  chips  which  were  successfully  fabricated,  and  in  experi¬ 
ments  with  layout  tools  [2].  Some  layout  tools,  like  a  PLA  generator  and  a 
pad  router,  have  been  w ;  itten  ir.  ALLENDE  with  little  effort. 

One  importani  aspect  of  a  system,  seen  only  when  you  use  it.  is 
detection  and  location  of  errors.  In  ALLENDE,  the  layouts  produced  are 
correct  by  definition,  in  terms  of  connectivity  and  design  rules.  In  case  of 
errors  in  the  user  specification  of  the  layout,  the  system  points  the  cells 
and  wires  involved  in  the  error. 

As  far  as  compaction  is  concerned,  the  layouts  produced  by  the 
system  are  relatively  dense.  It  is  hard  to  make  a  comparison  of  layout 
density  for  layouts  produced  by  ALLENDE  and  those  produced  by  hand, 
because  that  depends  on  the  regularity  of  the  layout  and  on  the  expertise 
of  the  designer.  Based  on  our  experiments,  we  find  that  for  regular  struc¬ 
tures  we  obtain  something  close  to  the  same  density,  while  for  irregular 
structures  we  spend  about  20  percent  more  area  than  the  corresponding 
hand-packed  layout. 

Tne  structured  representation  of  the  layout  and  the  use  of  an  inter¬ 
mediate  language  (PIF)  have  led  to  a  very  efficient  system,  in  terms  of 
space  and  execution  time,  and  a  straightforward  system  implementation. 

The  ALLENDE  layout  system  is  based  in  the  nMOS  technology,  this 
system  was  an  experiment  and  the  nMOS  technology  is  well  understood. 
There  are  plans  to  extend  the  language  to  the  CMOS  technology,  and  also 
to  allow  more  layers,  like  a  second  metal  layer.  Besides  that,  we  intend  to 
investigate  its  applicability  in  the  design  of  printed  circuit  boards. 

A  graphics  editor  car.  be  easily  incorporated  to  the  ALLENDE  sys¬ 
tem.  The  main  characteristics  of  this  editor,  compared  to  other  layout 
editors,  would  be: 


—  the  objects  dealt  with  by  the  designer  are  cells,  and  not  shapes; 

—  the  objects  are  composed  in  a  structured  way; 

—  the  designer  only  specifies  relative  positions;  the  absolute  positions 
are  determined  by  the  system,  taking  in  account  the  design  rules, 

—  the  obtained  layout  is  free  of  design  rule  violations;  no  checking  is 
necessary; 

—  connectivity  is  also  guaranteed,  and  the  circuit  can  be  directly  ob¬ 
tained 

In  an  ALLENDE  program  the  circuit  structure  and  the  layout  struc¬ 
ture  overlap,  that  is,  the  user  describes  at  the  same  time  both  the  circuit 
structure  and  the  layout  structure.  The  circuit  structure  gees  down  to 
the  level  of  transistors  and  contacts.  We  could  have  a  language  allowing 
the  specification  of  the  circuit  structure  at  a  higher  level  (at  the  gate 
level  or  at  the  functional  level,  for  example).  From  this  specification  the 
layout  in  PIF  would  be  generated.  Generalizing,  PIF  (or  ALLENDE)  could 
be  the  target  language  for  a  VLSI  design  tool,  even  a  silicon  compiler 

Besides  being  a  powerful  tool,  ALLENDE  is  associated  with  a  struc¬ 
tured  methodology  for  VLSi  design.  Having  a  tool  that  enforces  hierarchy 
and  the  use  of  regular  structures  is  going  to  improve  the  way  we  design 
integrated  circuits.  That  is  one  step  in  the  direction  of  managing  the  VLSI 
design  complexity. 
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Abstract 

This  paper  describes  a  method  for  local  optimization  of  VLSI  leaf  cells,  using 
the  parameterized  procedural  layout  language  ALLENDE  [5],  Tradeoffs  among 
delay  time,  power  consumption,  and  area  are  illustrated.  Three  different  imple¬ 
mentations  of  the  1-bit  full  adder  ara  compared:  a  random  logic  circuit,  a  data 
selector,  and  a  PLA.  The  fastest  random  logic  1-bit  full  adder  has  a  time-power 
product  about  1/3  that  of  the  fastest  data  selector,  and  about  1/4  that  of  the 
fastest  PLA.  The  4-bit  parallel  adder  is  used  to  illustrate  the  effect  of  loading 
when  leaf  cells  are  combined. 

1.  Introduction 

In  the  design  of  a  custom  VLSI  chips  it  often  happens  that  there  is  one  cell 
that  is  used  many  times,  usually  in  an  array  or  a  recursive  structure.  The  fact 
that  a  cell  is  used  many  times  means  that  there  is  a  large  potential  payoff  in  its 
optimization,  and  that  the  problem  can  be  made  small  enough  to  be  manage¬ 
able.  Arrays  of  cells  are  especially  common  in  digital  signal  processing  applica¬ 
tions,  where  regular  structures,  like  systolic  arrays,  lead  to  designs  that  are 
easy  to  lay  out  efficiently,  and  have  high  throughput.  As  examples,  bit-parallel 
and  bit-serial  multipliers  can  be  constructed  from  one-  and  two-dimensional 
arrays  of  one-bit  full  adders,  as  can  a  wide  variety  of  pipelined  FIR  and  I1R  filters 
(see  [l],  for  example).  As  another  example,  a  processor  for  updating  one¬ 
dimensional  cellular  automata  has  been  designed  at  Princeton  which  consists  of 
a  one-dimensional  array  of  5-input/  1-cutput  PLA's  [10].  In  such  cases  the  prob¬ 
lem  of  making  most  efficient  use  of  a  given  piece  of  silicon  breaks  down  into  two 
distinct  problems:  1)  choice  of  the  global  packing  strategy  (the  method  of  laying 
out  and  interconnect  ig  leaf  cells,  and  connecting  them  to  power  and  clocks), 
and  2)  the  design  of  the  iterated  structure  itself  (which  we  call  the  leaf  cell).  In 
this  paper  we  study  the  second  problem:  the  design  of  efficient  leaf  cells.  The 
example  used  throughout  is  the  most  common  in  digital  signal  processing,  the 
1-bit  full  adder. 

There  are  three  important  measures  of  how  good  a  leaf  cell  is:  its  time 
delay  T\  its  peak  or  average  power  dissipation  Pmu  or  P, and  its  area  A. 

r  Hus  work  was  supported  by  National  Science  Foundation  Grants  ECS-8307955,  U.S.  Army 
Army  Research  Oftce  Durham,  NC,  under  Grant  DAAG29-B2-X-0095,  DAR?A  Contract  N00014- 
82-X-0549,  and  ONR  Grant  N00014-83-K-0275. 
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Ideally,  the  designer  should  be  able  to  trade  off  these  measures,  one  against  the 
other.  For  example,  in  one  application  the  clock  may  be  fixed  at  a  known  value 
To,  and  it  would  therefore  be  senseless  to  make  the  the  cell  faster.  On  the  other 
hand,  peak  power  may  be  a  real  constraint  because  of  heat  dissipation  limita¬ 
tions,  and  at  the  same  time  it  may  be  important  to  keep  the  area  small  so  as  to 
fit  as  many  cells  on  one  chip  as  possible.  We  might  therefore  try  to  minimize 
some  measure  of  the  peak  power  and  area  (the  product,  for  example),  while 
enforcing  the  constraint  T  ^  T0.  In  other  applications  speed  may  be  critical,  and 
it  may  be  important  to  minimize  T  while  observing  constraints  on  Pp  and  A,  and 
so  on.  In  general,  we  would  like  to  have  enough  information  about  the  tradeoffs 
among  the  measures  T,  P  and  A  to  make  intelligent  design  decisions.  As  we  will 
see,  the  P—T  tradeoff  is  often  of  most  interest,  since  the  area  is  often  a  less  sen¬ 
sitive  function  of  design  parameters  (at  least  for  fixed  topology). 

2.  Formulation 

The  basic  approach  we  take  will  be  to  search  tor  local  improvements  on  ran¬ 
dom  initial  designs.  The  search  strategy  will  be  to  consider  ail  single  or  double 
changes  in  element  size  along  the  critical  path.  When  only  single  changes  are 
tried,  we  call  the  prt  cedure  "1-change",  when  double  changes  are  tried,  "2- 
change".  The  idea  is  that  the  critical  path  indicates  which  parameters  are  most 
important  to  performance  at  any  given  point  in  the  analysis. 

We  will  limit  the  optimization  to  choice  of  pulldown  widths.  The  method  can 
be  extended  to  choice  of  layers,  orientation,  and  topologies.  We  will,  however, 
study  three  radically  different  topologies  for  the  full  adder:  the  PLA,  data- 
selector,  and  random  logic. 

The  main  analysis  toois  used  in  these  experiments  are  the  timing  simulator 
CRYSTAL,  and  the  power-estimation  program  POWEST,  together  with  the  rest  of 
the  Berkeley  tool  package  [2]. 

Another  essential  component  of  the  work  is  a  procedural,  constraint-based 
layout  language  for  specifying  VLSI  layouts;  in  this  case,  we  used  the  new 
language  ALLENDE  being  developed  at  Princeton,  a  successor  to  ALI2  and  CLAY 
[3,4,5],  This  allows  us  to  specify  circuit  parameters  and  have  a  cifplot  generated 
automatically. 

3.  The  Critical-Path  Optimization 

Figure  1  shows  how  the  optimization  is  performed  in  our  experiments.  In 
Figure  1  faparm  is  an  input  parameter  vector  to  ANALYSIS  which  has  diffusion 
widths  of  nodes  as  described  in  section  4.  The  initial  faparm  is  generated  at  ran¬ 
dom  by  RANDOM  according  to  its  input  file  pattern,  ANALYSIS  takes  faparm  as 
its  input  and  generates  an  appropriate  layout  and  its  resulting  T,  P,  and  A,  as 
well  as  the  nodes  on  the  critical  path  (  hereafter  called  the  critical  path  nodes 
).  Since  every  node  on  the  critical  path  has  an  associated  parameter  in  faparm, 
GASEGEN  can  generate  faparms  as  subcases  by  using  the  one-(two-)change 
method.  Here  the  one-(two-)change  method  changes  one(two)  parameter(s) 
associated  with  the  critical  path  nodes  by  one  step.  (  From  here  on  the  1- 
change  method  is  denoted  by  1-opt  or  Random  1-opt,  and  the  2-change  method 
by  2-opt  or  Random  2-opt.  ) 

The  optimization  strategy  is  shown  in  the  flowchart  of  figure  2.  When  the 
first  improvement  occurs,  this  case  is  picked  up  for  the  next  iteration.  If  no 
improvement  occurs  but  there  exists  a  case  which  has  the  same  cost  and  has 
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not  yet  been  analyzed,  this  case  is  adopted  next.  Otherwise  a  new  random 
faparm  is  generated  for  the  next  iteration,  to  search  for  other  locally  optimal 
points.  We  used  two  cost  criteria  for  optimization:  T,  and  PmaxT  (hereafter 
denoted  by  PT).  Figure  3  shows  an  outline  of  the  main  procedures  used  in  the 
ANALYSIS  loop.  A  short  description  of  each  follows  below: 

1)  ALLENDE  This  procedural  constraint-based  VLSI  layout  language  pro¬ 

duces  an  integrated  circuit  layout  in  Caltech  International 
Form  (C1F)  corresponding  to  the  specified  parameters  [5]. 

2)  MEXTRA  MEXTRA  reads  CIF  and  extracts  the  nodes  to  create  a  circuit 

description  for  further  analyses  [2]. 

3)  CRYSTAL  CRYSTAL  is  used  for  finding  the  worst-case  delay  time  of  the 

circuit  [2]. 

4)  POWEST  POWEST  is  used  for  finding  the  average  and  maximum  power 

consumption  of  the  circuit. 

5)  CRITICAL  CRITICAL  reports  the  critical  path  nodes  by  using  the  output 

of  CRYSTAL. 

6)  LIST  This  corrmand  stores  the  vector  of  results  (T.P.A)  in  the  HIS¬ 

TORY  file  for  further  optimization. 

In  figure  3  the  squares  surrounded  by  dotted  lines  are  files  used  for  inputs 
or  outputs  of  the  above  procedures. 

1)  faparm  The  faparm  has  parameters  for  layout  genera¬ 

tion;  for  example,  the  diffusion  width  of  each 
node,  the  permutation  of  product  terms  in  a 
PLA,  etc. 

2)  layout  generating;  program  There  are  several  ALLENDE  programs  imple¬ 

menting  desired  circuit  topologies  such  as  the 
PLA,  random  logic,  etc.  Each  program  requires 
parameters  in  its  corresponding  faparm. 

3)  the  critical  path  nodes  The  critical  path  nodes  are  extracted  from  the 

output  of  CRYSTAL.  Each  node  can  be  associ¬ 
ated  with  parameters  in  faparm.  This  is  done  by 
looking  up  a  table  for  each  topology,  which 
associates  each  node  with  its  corresponding 
parameter. 

4.  Full-Adder  Circuit  Implementations 

As  mentioned  in  the  Introduction,  we  adopted  the  1-bit  full-adder  circuit  as 
an  example  for  experimentation,  because  it  is  relatively  simple,  but  is  a  basic 
arithmetic  logic  circuit.  The  1-bit  full-adder  circuit  can  be  implemented  in 
many  ways.  We  chose  three  kinds  of  circuits:  the  PLA,  Data  Selector,  and  Ran¬ 
dom  logic.  Each  layout  has  several  parameters.  We  will  use  the  vector  represen¬ 
tation  of  these  parameters;  that  is  d  =  (  dj.dg . d*  )  means  that  the 

diffusion  width  of  node  i  is  djA.  We  also  use  the  vector  k  -  (  kx,kz . kn  )  to 

mean  that  the  pullup  to  pulldown  ratio  of  the  inverter,  NOR  or  NAND  circuit  in 
which  node  i  exists  is  A^.  The  vector  k  is  fixed  for  each  circuit. 

1)  PLA 
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Figure  4  shows  the  full-adder  circuit  diagram  implemented  by  a  programmable 
logic  array  (PLA)  [7].  This  layout  has  the  following  17  parameters  and  2  permu¬ 
tations. 

^  =  (  ■  ■  •  •  c^andy'^orl>^vr2'^inlii . ) 

k  -  (  414,4,4,4.4141414,4.4,4,414,4,414  ) 

-  7  pulldown  diffusion  widths  of  the  AND  plane. 

-  2  pulldown  diffusion  widths  of  the  OR  plane. 

-  6  pulldown  diffusion  widths  for  inputs. 

-  2  pulldown  diffusion  widths  for  outputs. 

-  1  permutation  of  product  terms  in  the  AND  plane. 

-  1  permutation  of  outputs. 

In  the  optimization  process,  the  two  permutations  are  fixed  for  the  sake  of  sim¬ 
plicity.  However  those  two  permutations  are  chosen  in  advance  in  order  to  give 
the  best  result  before  the  optimization  by  doing  experiments  based  on  various 
random  permutations  as  inputs. 


2)  Berkeley  PLA 

The  PLA  generated  by  using  mkpla  of  the  Berkeley  VLSI  tools  [2,8]  is  used  for 
the  purpose  of  cost  comparison  with  the  PLA  implemented  in  1).  This  PLA  is  not 
optimized,  but  uses  the  following  fixed  parameter  vector. 

d  =  (  4, 4, 4, 4,8, 8, 8.8, 8, 8, 8, 8  ) 

k  =  (  4, 4, 4,4, 4, 4, 4, 4, 4.4, 4, 4,4, 4,4, 4, 4  ) 

3)  Data  Selector 

Figure  5  shows  the  full-adder  circuit  diagram  of  a  Data  Selector  implementation 
[9],  The  following  truth  table  is  used. 


C\  B  S  C0 


0  0  A  Q  (or  B) 

0  1  A  A 

10  A  A 

1  1  A  Q  (or  B) 


This  circuit  selects  inputs  ( A,  A,  or  Q  )  instead  of  calculating  S  and  C0.  Here  Q 
is  the  input  carry  signal,  C,  is  the  output  carry  signal,  and  5  is  the  output  sum 
signal.  A  and  B  denote  the  two  other  inputs.  This  layout  has  the  following  8 
parameters. 

d  =  (  dA,dB.dq,di,d2,d2,dCo,ds  ) 

k  =  (4, 4, 4, 4, 8,4, 8, 8) 

-  3  pulldown  diffusion  widths  for  input  inverters. 

-  3  pulldown  diffusion  widths  for  internal  inverters. 

-  2  pulldown  diffusion  widths  for  output  inverters. 

4)  Random  Logic 

Figure  6  shows  the  circuit  diagram  of  the  Random  Logic  Implementation  [6]. 


This  layout  has  the  following  4  parameters. 
d  =  (  d^dz.dC'.ds  ) 
k  =  (8,12.4,4) 

-  2  pulldown  diffusion  widths  for  internal  inverters. 

-  2  pulldown  diffusion  widths  for  output  inverters. 

All  the  circuits  above  were  verified  by  ESDI  [2]  or  STMUIATE  [5]. 

5.  Parameterization 

The  diffusion  width  of  the  pullup  in  each  stage  is  automatically  determined 
and  implemented  by  ALLENDE  in  the  following  way.  Suppose  that  the  current 
parameter  vector  is  d  =  (  dudz . dn  ) ,  and  the  pullup-to-pulldown  ratio  vec¬ 
tor  of  the  specified  layout  is  k  -  (  .  .  .  .k^  )  .  (The  choice  of  pullup-to- 

pulldown  ratio  is  discussed  in  [7].)  For  each  node  i,  define  the  variables  Zp u,  2 
and  a  pullup-to-pulldown  ratio  K  as  follows. 


Z  =  ^ 

r,pu 


Cpd 


-  L p d  „  _  Zpu 

~~  TAJ  •  ^  “  7 


W. 


pd 


where 

Lpn  (  Lpd  )  is  the  length  of  pullup  (pulldown). 
Wpu  (  Wpd  )  is  the  width  of  pullup  (pulldown). 
Wpi-di,  K  =  /fc*  and  Zp*  =  2. 

Zpu  and  FKpu  are  determined  as  follows. 

If  Wp*  s=  2 K 


Wpu 
K  = 


=  2 
Lpu/  2 
2  /Wpd 


or 


Zpu  — 


4  K 


If  Wpd  >  2 K 


Wpu  =  Wri  /  K 


pd 


K 


_  Lpu/ 


W. , 


pu 


2/  W. 


or 


pd 


Lpti  — 


2  KW^ 


We  adopted  following  choices. 

1)  \  =  2  m 

2)  The  timing  estimation  program  CRYSTAL  uses  an  input  pulse  which  is  1  nsec 
wide. 


6.  Results 

Table  1  shows  a  comparison  of  the  performance  of  our  implementations. 
Each  row  represents  one  locally  optimal  point  using  as  criterion  the  item  indi¬ 
cated  by  •.  The  units  of  A,  Pmt,  Pmu<  T,  APT  and  PT  are  Xz,  (10-8  *  W), 
(10*8  •  w),  ns,  (10-12  •  X2  *  W  •  ns)  and  (10-8  •  W  •  ns)  respectively  in  all 
tables.  Figure  7  shows  vs  T  curves  for  different  topologies,  while  figure  8 
shows  several  Pma  vs  T  trajectories  obtained  during  the  process  of  optimization 
using  the  1-change  and  2-change  methods  for  the  Data  Selector  and  the  Random 


Table  1.  performance  comparison  (1  bit  full  adder) 


type 

A 

P R* 

p 

■‘max 

T 

APT 

PT 

parameter 

PLA 

21560 

6472 

10183 

12.8* 

2802 

1303 

1) 

21840 

5678 

9241 

15.3* 

3087 

1413 

2) 

21762 

5503 

8616 

14.9* 

2794 

1284 

3) 

PLA(Berkeley) 

22176 

7314 

11749 

12.8* 

3339 

1504 

4) 

Data  Selector 

8100 

3765 

6317 

15.8* 

783 

966 

88888B88 

8100 

3529 

5645 

16.5* 

754 

931 

88848888 

8190 

3764 

6116 

15.9* 

796 

972 

12  8888888 

Random  Logic 

7742 

1331 

1957 

16.5* 

392 

323 

16  12  3  2 

9600 

1683 

2427 

16.4* 

382 

398 

16  24  2  3 

9800 

1644 

2329 

16.4* 

378 

382 

16  24  2  2 

9600 

1723 

2506 

16.5* 

397 

413 

16  24  3  3 

5194 

705 

1096 

22.6 

128 

248* 

6  8  2  2 

4704 

626 

1018 

25.9 

124 

264* 

4  6  3  2 

5136 

744 

1174 

22.9 

138 

269* 

6  8  2  3 

1)  d.  =  (  4, 4, 4, 4, 4. 4, 3, 4, 4, 8, 8, 8,4, 4, 4, 8, 2  ) 

2)  d  =  (  4.2, 3,3,3, 3.3, 4.3, 8,8,8, 4, 4, 4.8, 2  ) 

3)  d  =  (  3,3,3, 4.4. 4. 4.3, 3, 9,8, 8. 4.4. 4, 4, 3  ) 

4)  d  =  (  4, 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8  ) 


Table  2  performance  comparison  (4  bit  parallel  adder) 


type 

A 

Pout 

P  max 

T 

APT 

PT 

parameter 

Data  Selector 

41310 

16536 

28218 

75.3* 

877761 

212482 

4  8  8  8  16  8  16  16 

44550 

16536 

28218 

84.1* 

1057230 

237313 

4  8  8  8  16  8  24  16 

45409 

16534 

28213 

84.3* 

1079990 

287836 

48  8  16  16  16  16  16 

44523 

13248 

21641 

91.0* 

876805 

196933 

4888  16  484 

42845 

12301 

19748 

92.5* 

782645 

182669 

4844  16  884 

43747 

11362 

17868 

94.9* 

741806 

169567 

4844  16  484 

43605 

12354 

20692 

98.0* 

884229 

202782 

2848  16  848 

45441 

11885 

19753 

100.8* 

904777 

199110 

2848  16  844 

44523 

12305 

19755 

101.1* 

889227 

199723 

4848  16  448 

44649 

11831 

18808 

103.2* 

866631 

194099 

4844  16  844 

43747 

11362 

17868 

103.6* 

809812 

185112 

4844  16  448 

Random  Logic 

35552 

6577 

10335 

41.1* 

151014 

42476 

16  12  8  2 

34848 

6734 

10649 

41.4* 

153634 

44087 

16  12  8  3 
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Logic  circuit.  Each  point  takes  about  1.5  minutes  of  cpu  time  on  a  VAX  11/750. 
Many  of  the  locally  optimal  solutions  have  identical  parameter  values  on  the 
critical  path,  but  differ  in  other  coordinates  because  of  different  random  start¬ 
ing  values. 

7.  Parallel  Adder  :  The  effect  of  loading  factors 

The  preceding  results  did  not  take  the  loading  on  the  output  of  the  circuit 
into  account.  When  these  circuits  are  used  in  arrays,  this  may  become  impor¬ 
tant.  To  study  this  problem,  we  implemented  two  circuits  for  a  4-bit  parallel 
adder,  using  the  Data  Selector  and  the  Random  Logic  1-bit  full  adders  of  the 
previous  section.  The  results  are  shown  in  Table  2. 

B.  Discussion  of  Results 

6.1.  Pmmr  vs  T  tradeoff 

Figure  8  shows  Pmtx'T  trajectories  followed  by  the  critical  path  optimiza¬ 
tion  process,  when  minimizing  T  for  the  Random  Logic  circuit.  The  dotted 
envelope  shows  the  final  tradeoff  curve  for  P  vs  T.  Notice  that  the  locally 
optimal  point  obtained  by  using  PT  as  the  cost  criterion  lies  very  close  to  the 
trajectory  obtained  when  minimizing  T.  (  See  point  a,  with  P  -  12.5 mW,  and 
T  -  22.4ns.)  For  comparison,  the  optimization  for  PT  gave  us  a  locally  optimal 
point  b  with  P  =  10.9mFK  and  T  =  22.6ns,  very  close  to  point  a.  Thus,  optimiza¬ 
tion  using  the  two  criteria  is  consistent. 

6.2.  Performance  comparison  among  the  P1A  Data  selector,  and  Random  logic. 
Table  3  normalized  performance  comparison  (1-bit  full  adder) 


Random  Logic  100 
Data  Selector  105 
PLA  278 

PLA(Berkeley)  286 


T 

APT 

PT 

100 

100 

100 

100 

100 

283 

313 

96 

200 

299 

486 

520 

78 

715 

403 

550 

600 

78 

852 

466 

Table  3  shows  a  normalized  performance  comparison  of  the  best  locally 
optimal  point  for  each  layout,  minimizing  T.  The  Random  Logic  seems  to  be  the 
best  choice  in  all  respects  except  T.  However,  it  is  the  fastest  among  the  4-bit 
parallel  adder  implementations.  The  T  of  the  4-bit  parallel  adder  using  Random 
Logic  is  less  than  4  times  the  T  of  the  1-bit  full  adder,  while  in  the  other  layouts 
it  is  more  than  4  times  the  T  of  the  1-bit  full  adder.  The  reason  is  that  this  Ran¬ 
dom  Logic  1-bit  full  adder  circuit  calculates  the  carry  signal  and  propagates  it 
before  the  calculation  of  the  sum  signal,  so  the  carry  ripple  propagates  faster 
than  the  sum.  As  a  result,  the  4-bit  parallel  adder  takes  only  2.5  times  as  much 
time  as  the  1-bit  full  adder.  Figure  7  shows  the  P-T  tradeoff  curve  of  each  lay¬ 
out.  The  curve  for  the  Random  Logic  circuit  is  below  the  one  for  the  Data  Selec¬ 
tor,  which  is  below  that  for  the  P1A  Hence  we  can  order  the  layouts  with  Ran¬ 
dom  logic  best,  Data  Selector  next,  and  PLA  last.  This  result  agrees  with  our 
intuition  because  this  order  is  the  same  as  the  order  of  circuit  specialization. 


8.3.  Comparison  between  our  PLA  and  the  Berkeley  PLA 

Both  PLA’s  have  almost  the  same  costs,  except  for  P.  The  reason  is  that  our 
locally  optimal  point  occurs  at  the  choice  d  =  (4, 4, 4, 4, 4, 4, 3, 4, 4, 8, 8, 8, 4,4, 4, 8, 2), 


while  the  Berkeley  PLA  adopts  d  =  (4. 4,4, 4.4, 4.4, 4,4,8, 8, 8, 8, 8, 8, 8,8).  The  Berke¬ 
ley  PLA  is  therefore  very  close  to  locally  optimum  with  respect  to  T. 

8.4  Comparison  with  Myers'  work 

Myers  did  similar  performance  comparisons  of  various  1-bit  full  adder 
implementations  [9],  but  did  not  use  any  optimization.  His  results,  shown  in 
Table  4  below,  are  quite  different  from  ours,  shown  in  Table  3.  Our  results  show 
that  an  appropriate  choice  of  layout  and  its  optimization  makes  the  Random 
Logic  circuit  better  than  the  Data  Selector,  and  that  the  PLA  can  be  made  very 
fast  at  the  expense  of  Power. 

Table  4.  1-bit  full  adder  normalized  performance  comparison  (Myers[9]) 


type  A  Pw„  T  APT  PT 


Random  Logic 

100 

100 

100 

100 

100 

Data  Selector 

45 

50 

125 

28 

72.5 

PLA 

105 

110 

170 

196 

187 

8.5.  4-bit  Parallel  Adder 

Tables  1  and  2  show  that  the  locally  optimal  point  of  the  1-bit  full  adder  is 
attained  with  a  pull-down  diffusion  width  of  the  carry  output  stage  dCe  =  2  or  3, 
while  the  corresponding  width  for  the  4-bit  parallel  adder  is  cf^  =  8.  The  pullup 
width  remains  2.  This  suggests  that  the  critical  path  passes  through  the  pull¬ 
down  of  the  output  carry  stage,  which  is  indeed  the  case. 

On  the  hand,  for  the  Data  Selector,  the  critical  path  passes  through  the 
pullup  of  the  output  carry  stage,  and  in  fact  it  is  the  pullup  width  that  expands 
during  optimization  of  the  4-bit  parallel  adder. 

8.6.  Comparison  of  the  1-change  and  2-change  methods 

Figure  8  and  Table  5  show  a  comparison  between  the  1-change  and  the  2- 
change  methods  when  applied  to  the  Random  Logic  implementation.  Table  5  is 
discussed  in  the  next  section.  The  slope  of  the  2-change  method  is  steeper  than 
that  of  the  1-change  method,  but  the  2-change  method  reaches  better  locally 
optimal  points.  Hence  in  this  case  the  2-change  method  works  better  than  the 
1-change  method  does.  However,  the  2-change  method  does  not  work  as  well  as 
the  1-change  method  for  the  Data  Selector,  which  has  many  more  parameters. 
The  2-change  method  took  more  iterations  than  the  1-change  method  and  did 
not  obtain  better  locally  optimal  points. 

8.7.  Effectiveness  of  our  optimization:  Cost  Improvement  ratio 

Table  5  below  shows  the  average  initial  delay  times  T0  (obtained  from  ran¬ 
dom  starts),  the  average  locally  optimal  delay  time  Topt,  the  average  percent 
improvement  of  the  delay  time  T ,  and  the  best  locally  optimal  delay  time  7&,,<  ■ 
We  can  see  from  this  that  2-opt  performs  much  better  than  1-opL  We  should  note 
that  it  is  very  important  to  choose  a  good  order  in  which  to  try  improvements, 
because  this  saves  unnecessary  search  time  evaluating  changes  that  are 
unlikely  to  be  improvements.  For  example,  we  chose  the  diffusion  widths  of  the 
3-input  NAND  gate  as  the  first  parameters  tried  for  the  Random  Logic  circuit. 


Table  5  Cost  improvement  of  our  optimization  methods 


JO 


r 


I* 


type 

opt 

criterion 

7 initial 

I'opt 

%  improvement 

Random  Logic 

1-opt 

T 

29.7 

19.2 

33 

19.1 

Random  Logic 

2-opt 

T 

29.7 

16.8 

42 

16.4 

Data  Selector 

1-opt 

T 

24.3 

17.7 

25 

15.8 

Data  Selector 

2-opt 

T 

23.5 

18.0 

23 

15.8 

PLA 

1-opt 

T 

19.3 

16.3 

16 

12.8 
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ABSTRACT 

A  parallel  processing  architecture  for  the  solution 
of  partial  differential  equations  by  point  iteration  is  stu¬ 
died.  Grid  points  are  stored  in  a  circulating  memory 
and  identical  processors  are  spaced  around  the  store. 
Computer  simulation  of  the  solution  of  Laplace's  equa¬ 
tion  with  a  simple  point  iteration  relaxation  algorithm 
for  one-,  two-,  and  three-dimensional  problems  shows 
that  convergence  rates  intermediate  between  those  of 
the  Jacobi  and  Gauss-Seidel  methods  are  obtained. 
Hardware  utilization  efficiency  (speedup  relative  to  the 
number  of  processors)  of  40-60"  is  achieved  with  as 
many  as  A'  processors,  where  A  is  the  number  of  non¬ 
boundary  grid  points.  Furthermore,  for  up  to  AY  2  pro¬ 
cessors,  the  efficiency  remains  above  90"  in  the  one¬ 
dimensional  case,  and  above  75/1  in  the  two-dimensional 
case.  There  are  sharply  diminishing  returns  for  using 
more  than  AY  2  processors. 

1.  Introduction 

The  solution  of  partial  differential  equations  taxes 
the  largest  and  fastest  present-day  general-purpose 
computers.  Physically  meaningful  problems  often  need 
huge  amounts  of  time  and  space.  Clearly,  with  the 
decreasing  cost  of  large-scale  integrated  circuits,  it 
seems  profitable  to  build  special-purpose  devices  for 
solving  partial  differential  equations  which  use  many 
identical  processors  operating  in  parallel  This  paper 
describes  a  study  of  a  circular  arrangement  of  proces¬ 
sors  and  a  circulating  store.  Simulation  results  for  the 
very  simplest  numerical  problem  are  described:  the 
solution  of  Laplace's  equation  with  Dirichlet  boundary 
conditions,  using  point  iteration  methods. 

When  explicit,  point  iterative  methods  are  used  to 
solve  partial  differentia!  equations,  a  grid  point  value  is 
updated  by  replacing  it  with  some  function  of  the  values 
at  neighboring  points  in  the  grid.  In  the  system 
describee  here,  the  grid  of  points  is  mapped  by  a  raster 
scan  into  a  circulating  serial  bit  stream.  The  bit  stream 
passes  through  processors  that  update  the  grid  points 


as  they  pass  through  the  processors.  Using  this 
approach,  problems  with  arbitrary  numbers  of  dimen¬ 
sions  can  be  treated.  In  addition,  identical  processors 
can  be  added  without  reorganization. 

In  the  Gauss-Seidel  method  [B]  the  grid  point 
values  are  updated  in  an  orderly,  row-by-row  and 
plane-by-plane  fashion,  and  new  values  are  used  as  soon 
as  they  become  available.  In  the  Jacobi  method  [B]  old 
grid  point  values  are  used  throughout  each  iteration. 
When  the  circulating  store  system  uses  one  processor, 
the  calculation  reduces  to  the  Gauss-Seidel  method, 
when  it  uses  one  processor  per  grid  point,  it  reduces  to 
the  Jacobi  method.  When  a  number  of  processors 
between  these  two  extremes  is  used,  there  is  a  compli¬ 
cated  mixture  of  old  and  new  values  used  by  the  proces¬ 
sors.  The  purpose  of  this  paper  is  to  investigate  experi¬ 
mentally  the  rate  of  convergence  as  a  function  of  the 
number  of  processors,  and  thereby  to  evaluate  the 
potential  hardware  utilization  efficiency  of  the  circulat¬ 
ing  store  system.  The  results  show  that  the  speed  of 
convergence  is,  as  might  be  expected,  intermediate 
between  the  Gauss-Seidel  and  Jacobi  methods. 

2.  Circulating  Store  Configuration 

We  study  here  a  synchronous  circuit  consisting  of  a 
long  shift  register  arranged  in  a  circle  (the  mam 
memory  of  the  system),  and  a  number  of  independent 
processors  tapping  and  updating  the  stream  at  various 
points  (see  Fig.  1).  Similar  configurations  have  been 
suggested  by  various  workers  at  different  times  for 
applications  such  as  Monte  Carlo  calculations  fl]  and 
image  processing  [2],  as  well  as  partial  different.a: 
equations  [3,4,5].  A  fixed  network  of  microprocessors 
which  communicate  locally  has  also  been  studied  [6,"’j. 
and  these  two  arrangements  are  equivalent  when  there 
is  one  processor  per  grid  point. 

If  each  grid  point  is  mapped  to  a  set  of  contiguous 
bits  in  the  stream,  some  bits  in  the  set  can  represent 
the  value  of  the  function  at  the  grid  point,  while  the 
remaining  bits  can  be  used  to  flag  boundary  values. 
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store  space-dependent  coefficients,  and  possibly  hold 
other  information,  We  will  call  the  set  of  bits  in  the  bit 
stream  corresponding  to  a  grid  point  simply  a  "grid 
point  value."  At  any  giver,  time  each  processor  must  be 
able  to  change  the  value  of  the  grid  point  which  it  will 
update,  as  well  as  read  the  values  from  those  grid 
points  whose  values  are  needed  to  do  the  update  calcu¬ 
lation.  For  example,  with  a  5-point  molecule  in  the 
Gauss-Seidel  method  for  a  two-dimensional  Laplace 
equation,  each  processor  must  have  information  from  4 
neighboring  points,  as  shown  in  Fig.  2. 

There  is  no  direct  communication  between  proces¬ 
sors.  With  each  major  clock  cycle,  the  circular  shift 
register  shifts  one  complete  grid  point,  each  processor 
reads  the  necessary  information,  and  if  the  grid  point 
currently  associated  with  a  processor  is  not  a  boundary- 
value,  it  is  updated,  Each  grid  point  can  have  a  bit 
reserved  to  indicate  convergence,  based  on  the  change 
from  the  previous  value  of  the  function,  and  that  bit  can 
be  kept  current  es-ery  time  the  value  is  updated.  A 
counter  can  then  be  inserted  in  the  circulating  store  to 
detect  the  condition  where  all  the  points  have  con¬ 
verged.  Alternatively,  a  global  counter  can  receive  this 
information  from  every  processor  every  major  clock 
cycle.  We  will  assume  this  latter  method  in  the  simula¬ 
tion  because  it  detects  convergence  sooner  end  gives 
finer  resolution  in  the  measurement  of  running  time, 
but  obviously  it  is  not  necessary  for  the  operation  of  the 
device. 

Note  that  no  additional  time  is  required  to  observe 
the  changing  grid  point  values  on  a  graphics  screen, 
since  the  bit  stream  can  be  passed  serially  through 
such  a  display  device  without  interfering  with  the  calcu¬ 
lation. 

We  emphasize  that  the  processors  work  in  parallel, 
and  so  the  answer  from  one  processor  is  not  available 
until  the  next  major  clock  cycle.  Thus,  as  was  pointed 
out  before,  the  values  used  in  any  one  calculation  are  in 
general  of  various  ages. 

3.  Method  of  Performance  Evaluation 

When  all  grid  points  have  passed  through  all  pro¬ 
cessors  once,  we  say  that  one  iteration  has  taken  place. 
This  corresponds  to  each  bit  in  the  stream  being  shifted 
all  the  the  way  around  the  circle.  The  time  required  for 
this  360=  shift  depends  on  the  major  clock  cycle  time 
and  the  number  of  grid  points.  (We  assume  that  a  pro¬ 
cessor  completes  its  function  during  one  major  clock 
cycle  )  Since  the  processors  operate  in  parallel,  the 
time  does  not  depend  on  the  number  of  processors;  an 
iteration  represents  an  amount  of  time  that  is  indepen¬ 
dent  of  the  number  of  processors.  Neglecting  such 
things  as  time  for  loading  boundary  values,  the  number 
of  iterations  required  for  convergence  is  a  reasonable 
measure  of  real  time  required  for  convergence,  and  can 
be  used  to  compare  the  performance  of  different  sys¬ 
tems.  However,  for  systems  with  different  grid  sizes  or 
representing  different  equations,  an  iteration  may 
mean  different  things  and  thus  cannot  serve  as  a  basis 
for  comparison.  Note  also  that  there  is  no  reason  that 
the  number  of  iterations  required  for  convergence  need 
be  an  integer,  (Recall  that  we  are  using  a  global  counter 
to  detect  convergence.) 

Since  the  number  of  iterations  required  for  conver¬ 
gence  is  proportional  to  the  time  required  for  conver¬ 


gence,  a  reasonable  idea  of  the  performance  of  a  sys¬ 
tem  as  a  function  of  number  of  processors  can  be 
obtained  by  investigating  the  relationship  between  the 
number  of  iterations  required  and  the  number  of  pro¬ 
cessors.  If  we  let  ils(n)  be  the  number  of  iterations 
required  with  n  processors,  we  can  define 


e(n)  = 


its  ( 1)/  n 
its  (n ) 


to  be  the  efficiency  with  n  processors,  the  efficiency 
with  one  processor  being  100%.  in  general  the  efficiency 
will  be  less  than  100%.  but  it  is  not  impossible  for  it  to 
exceed  100%  (e.g.  two  processors  can  be  more  than 
twice  as  fast  as  one). 

4.  Test  Problem:  Laplace's  Equation 

A  computer  simulation  of  the  scheme  described 
above  has  been  carried  out  using  Laplace's  equation  ( 
7Z/  =  0)  on  a  line,  in  a  square,  and  in  a  cube,  with  Dtri- 
chlet  boundary  conditions.  Explicit  iterative  methods 
for  Laplace's  equation  are  widely  used,  and  their  con¬ 
vergence  characteristics  are  well  known.  (See  [B],  for 
example  )  The  one  used  in  the  simulation  is  the  sim¬ 
plest:  Each  grid  point  is  replaced  by  the  average  of  the 
points  immediately  adjacent  (not  diagonal)  to  it.  Thus, 
for  a  it -dimensional  problem,  a  point  is  replaced  by  the 
average  of  the  2ifc  points  adjacent  to  it  on  the  rectil¬ 
inear  lattice  of  grid  points  in  it -space. 

As  suggested  above,  the  grid  is  mapped  to  a  serial 
stream  by  using  a  raster  scan:  the  end  of  one  line  is 
connected  to  the  beginning  of  the  next.  Some  experi¬ 
ments  indicated  that  using  other  scanning  patterns, 
such  as  boustrophedon  (back  and  forth,  as  the  ox 
plows)  has  little  if  any  effect  on  the  results. 

The  convergence  criterion  used  is  based  on  the 
maximum  relative  change  in  function  value  at  the  grid 
points.  If  the  old  and  new  values  at  grid  point  k  are 
respectively  g'(k)  and  g,'t,\k).  then  we  say  we  have 
converged  at  point  k  if  at  the  most  recent  update  at 
point  k  we  have 

I *«♦»(*) -*•(*)■  <  e  S‘(it) 

where  t  is  the  convergence  criterion.  If  at  some 
moment  we  have  converged  at  all  grid  points,  we  say 
tbe  computation  itself  has  converged. 

Problems  with  a  variety  of  different  dimensions, 
grid  sizes,  boundary  values  and  tolerances  were  simu¬ 
lated  and  we  next  present  some  numerical  results. 

5.  Experimental  Results 

Figures  3-5  show  plots  of  efficiency  e(n)  vs.  n  for 
three  typical  problems,  of  one-,  two-,  and  three- 
dimensions.  In  all  three  cases  the  convergence  toler¬ 
ance  is  t  =  0.002,  and  non-boundary  grid  points  have 
the  initial  value  0. 

The  one-dimensional  grid  has  200  points,  including 
boundary  points.  One  boundary  value  is  0  0  and  the 
other  1.0.  The  two-dimensional  grid  is  20x20  points,  with 
the  square  boundary  having  the  value  1.0.  The  10x10x10 
point  three-dimensional  case  also  has  its  boundary- 
values  equal  to  1.0  everywhere. 

As  expected  because  of  the  gradual  transition 
between  the  extreme  cases  of  the  Causs-Seidel  and 
Jacobi  methods,  there  is  a  general  downward  trend  in 


efficiency.  Furthermore,  the  efficiency  decreases  from 
100%  to  about  50%,  as  would  be  expected  from  the  fact 
that  the  Jacobi  method  for  tbis  problem  is  theoretically 
asymptotically  one-half  as  fast  as  the  Gauss-Seidel  [8], 
An  efficiency  of  50%  with  n  processors  means  that  we 
are  converging  n/2  instead  of  n  times  as  fast  as  with 
one  processor 

The  different  dimensions  give  rise  to  different 
curve  shapes,  but  those  shapes  did  not  vary  much  as 
convergence  tolerance,  grid  size,  and  boundary  condi¬ 
tions  were  varied.  In  the  one-dimensional  case, 
efficiency  is  near  100%  as  long  as  the  number  of  proces¬ 
sors  is  less  than  half  the  number  of  non-boundary  grid 
points,  but  at  that  point,  efficiency  falls  off  sharply  to 
about  50%  with  200  processors.  In  two  dimensions  the 
efficiency  curve  has  two  fairly  distinct  levels,  with  the 
break  point  again  coming  at  approximately  half  the 
number  of  non-boundary  grid  points.  The  efficiency  plot 
for  three  dimensions  seems  not  to  have  two  distinct 
regions,  but  falls  off  gradually.  In  all  cases  there  is  a 
great  deal  of  local  jumping  up  and  down,  due  evidently 
to  the  particular  way  in  which  the  processors  use  the 
information  of  neighboring  processors  in  particular 
arrangements. 

The  maximum  absolute  speed  is  obtained  by  having 
A'  processors,  where  A'  is  the  number  of  non-boundary 
grid  points,  but  there  are  diminishing  returns  for  using 
more  than  about  S/2  processors.  In  any  particular 
application  the  choice  of  number  of  processors  will 
depend  on  the  cost  of  a  single  processor  relative  to  the 
cost  of  the  whole  system.  The  efficiency  with  A’  proces¬ 
sors  remains  above  40%.  sometimes  even  getting  as  high 
as  65%. 

8.  Over-relaxation 

A  preliminary  test  was  made  of  a  simple  over¬ 
relaxation  strategy  in  the  one-dimensional  case.  Here 
the  new  value  at  each  grid  point  is  defined  by 

where  pi1*1'  is  the  value  that  would  be  adopted  at  this 
step  if  over-relaxation  were  not  being  used,  and  a  i  1  is 
the  over-relaxation  parameter.  When  o  is  taken  to  be 
1.5  in  the  one-dimensional  case  described  in  Fig.  3,  the 
number  of  iterations  required  by  one  processor  is 
reduced  from  397,564  to  212,057,  an  increase  in  abso¬ 
lute  speed  of  97%.  Figure  6  shows  a  plot  of  the  efficiency 
vs.  n  for  a  =  1.5.  It  is  of  the  same  general  character  up 
to  S /  2.  showing  efficiency  near  100%  in  this  range,  but 
past  that  point  the  iteration  rapidly  becomes  unstable 
(with  the  efficiency  therefore  going  to  zero).  Further 
work  is  needed  to  explain  and  predict  the  stability  of 
the  over-relaxation  method  for  the  parallel  computation 
scheme  discussed  here. 

7.  Conclusions 

The  simulation  results  for  the  circulating  store 
method  and  the  standard  point  iteration  method  are  in 
accord  with  theory,  end  they  are  encouraging:  n  pro¬ 
cessors  never  operate  slower  than  about  n/2  times  as 
fast  as  one.  Furthermore,  for  up  to  S/2  processors, 
where  N  is  the  number  of  non-boundary  grid  points,  the 
efficiency  remains  above  90%  in  the  one-dimensiona! 
case,  and  above  75%  in  the  two-dimensional  case  There 
are  sharply  diminishing  returns  for  using  more  than 


S /  2  processors. 

The  approach  is  applicable  to  linear  and  nonlinear 
problems  of  any  dimension  with  any  boundary  condi¬ 
tions,  makes  efficient  use  of  large  numbers  of  identical 
processors,  and  has  a  very  simple,  linear,  interconnec¬ 
tion  pattern.  More  work  is  needed  to  determine  the  sta¬ 
bility  and  convergence  rates  of  the  over-relaxation 
method,  and  more  sophisticated  and  potentially  faster 
methods,  in  higher  dimensions,  for  more  ambitious 
problems. 
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Abstract- In  many  computational  tasks,  especially  in  signal  processing, 
it  is  the  throughput  that  is  important,  rather  than  the  latency,  or  delay. 
If  a  special-purpose  VLSI  chip  is  designed  for  a  particular  signal  process¬ 
ing  task,  such  as  FIR  filtering,  for  example,  the  maximum  clock  rate, 
and  hence  throughput,  is  determined  by  the  depth  of  the  combinational 
logic  between  registers  and  the  time  required  for  the  distribution  and 
operation  of  the  clock.  If  the  combinational  logic  is  sufficiently  deep 
(in  bit-parallel  circuits,  for  example),  the  throughput  can  be  increased 
by  inserting  intermediate  stages  of  clocked  latches.  This  is  at  the  ex¬ 
pense  of  increased  area  and  delay  to  operate  and  clock  the  intermediate 
registers.  Roughly  speaking,  the  strategy  amounts  to  using  more  of  the 
chip  area  to  store  information  useful  for  pipelining. 

This  paper  investigates  the  optimal  tradeoff  between  the  degree  of 
intermediate  latching  and  cost,  using  the  measure  AP ,  where  A  is  the 
chip  area  and  P  is  the  period  (the  reciprocal  of  throughput).  We  derive 
expressions  for  the  time  and  area  before  and  after  intermediate  latch¬ 
ing.  using  the  Mead-Conway  model,  both  for  the  cases  of  on-chip  and 
off-chip  clock  drivers.  The  results  show  that  significant  reductions  in 
AP  product  (reciprocal  of  throughput  per  unit  area)  can  be  achieved 
by  intermediate  latching  in  many  typical  signal  processing  applications, 
for  a  wide  range  of  circuit  parameters.  The  array  multiplier  is  used  as 
an  example 

I.  Introduction 

WHEN  cer'ain  tasks  are  implemented  with  special-purpose 
V'LSI  chips,  it  is  often  the  period  P  (time  between  suc¬ 
cessive  outputs)  that  is  crucial,  rather  than  the  latency  or  delay 
T  This  is  especially  true  in  signal  processing,  where  typical 
tasks  such  as  filtering  and  discrete  Fourier  transformation 
often  have  high  volume  requirements  and  relatively  lax  delay 
requirements.  Recent  work  has  described  bit-serial  and  bit- 
parallel  VLSI  architectures  that  do  in  fact  allow  the  period  to 
be  equal  to  the  clock  period  (see,  for  example,  (2),  [4]-[9], 
(12)).  In  [5],  1 7]  a  class  of  these  circuits  is  called  completely 
pipelined.  In  this  paper,  we  take  up  a  different  question,  that 
of  inserting  intermediate  stages  of  latching  so  as  to  maximize 
the  rate  at  which  the  clock  can  run  without  a  disproportionate 
blowup  in  area  requirements.  We  will  use  the  criterion  of  mini¬ 
mizing  the  AP  product,  where  A  is  the  area  of  the  VLSI  circuit 
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Fig.  1.  Two-phase  clocked  latches  between  stages  of  combinational 

logic. 


and  P  is  the  period.  The  AP  product  can  be  thought  of  as  the 
reciprocal  of  throughput  per  unit  area,  and  a  completely 
pipelined  circuit  optimal  with  respect  to  this  criterion  can  he 
claimed  to  make  best  use  of  chip  area.  Leiserson  and  Saxe 
[14]  treat  the  related  problem  of  redistributing  latches  so  as 
to  decrease  period,  but  they  do  not  consider  area  or  clocking 
penalities. 

We  assume  that  the  circuits  we  discuss  are  designed  along  the 
lines  described  by  Mead  and  Conway  [1] :  typically  that  a  two- 
phase  clock  is  used  to  transfer  information  between  registers 
(or  latches),  and  that  these  registers  are  separated  by  comhina- 
tional  logic.  The  following  sections  are  devoted  to  modeling 
the  time  and  area  requirements  o/  the  latches,  the  combina¬ 
tional  logic,  and  the  clock  driver.  We  then  consider  the  overall 
circuit  and  investigate  the  optimal  choice  of  the  amount  of 
latching  for  the  two  cases  of  on-chip  and  off-chip  clock  drivers. 
While  the  assumptions  made  about  first-order  circuit  behavior 
pertain  to  nMOS  technology,  the  analysis  technique  uses  di¬ 
mensionless  parameterization  and  is  applicable  to  any  situa¬ 
tions  with  deep  combinational  logic— typically  bit-parallel  cir¬ 
cuits.  A  representative  tradeoff  curve  is  shown  for  an  example. 

II.  Clock  Timing 

We  will  adopt  a  version  of  the  two-phase  clocking  system 
described  by  Seitz  in  [1,  ch.  7],  a  typical  stage  of  which  is 
shown  in  Fig.  1.  Fig.  2  shows  the  corresponding  liming  dia¬ 
gram:  First,  we  must  drive  the  phase  1  clock  signal  0,  high, 
taking  time  fclock  (the  clock  driver  time).  We  then  need  a 
minimum  time  tMay  (the  delay  time)  to  charge  the  input  stage 
of  the  combinational  logic.  Phase  1  must  then  go  low  (taking 
time  ffiocx),  and  phase  2  must  then  go  high  (also  taking  time 
?dockV  We  must  insure  that  there  is  a  minimum  time  r,2  dui- 
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Fig.  2.  Clock-timing  diagram. 


ing  which  both  clocks  are  low;  otherwise  we  run  the  risk  that 
skew  between  the  clock  phases  will  cause  both  clocks  to  be 
on  at  the  same  time.  This  brings  us  up  to  the  point  where  the 
combinational  logic  has  already  started  to  w  ork. 

The  input  values  propagate  through  the  combinational  logic, 
taking  some  time  rlogjc.  This  time  includes  the  time  during 
which  <? ,  is  brought  down  and  <>2  is  brought  up.  The  time 
will  ordinarily  dominate  the  clock-interchange  time,  but, 
in  general,  we  need  to  set  the  time  for  this  operation  to 

t  =  max  (Tiogic •  -'clock  +'12) 

where,  for  safe  operation  of  the  circuit,  f|0pC  must  of  course 
be  taken  as  the  maximum  delay  time  of  the  combinational 
logic. 

We  next  need  to  transfer  the  output  values  of  the  preceding 
logic  stage  to  the  input  of  the  latch  whose  output  is  controlled 
by  6 ji  that  is,  must  remain  on  for  a  minimum  charging 
time  tv,  (the  preset  time).  The  <f>2  clock  signal  must  then  be 
brought  down  (taking  another  clock  driver  time  /dock,  and 
another  dead  time  (fji)  provided  to  insure  nonoverlap  of 
clocks  in  case  of  clock  skew. 

The  minimum  period  P  of  the  circuit  is  therefore 

P  ~  -^ctock  +  ^  delay  *  ^set  *  ^21  +  m3X  (f|ogic  •  -^clock  ^  12 )■ 

To  be  more  accurate,  we  might  want  to  take  into  account  the 
tact  that  the  upgoing  and  downgoing  clock  waveforms  are 
not  completely  symmetric;  but  the  term  fdnck  can  be  taken 
to  represent  the  average  of  the  upgoing  and  downgoing  clock 
times  in  a  single  driver.  In  a  multistage  driver  the  stages  alter¬ 
nate  up  and  down,  and  we  can  take  fdock  to  be  the  sum  of 
’he  averages  of  the  upgoing  and  downgoing  times  along  the 
driving  chain. 

III.  Latch  Time  and  Space 

We  next  want  to  express  the  time  delay  of  the  latches  in 
terms  of  basic  units  that  are  determined  by  the  technology. 
For  this  purpose,  we  consider  the  nMOS  inverter  with  a  mini¬ 
mum  size  pulldown  and  a  pullup/pulldown  ratio  of  4  to  be 
the  basic  cell,  with  area  A,  pulldown  gate  capacitance  C,  ef- 
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Fie.  3.  Details  of  the  clocked  latches,  showing  pullup  and  pulldown 
effective  resistances  and  capacitances. 

fective  pulldown  resistance  R,  and  pulldown  time  ( transit 
time)  t  when  driving  the  input  of  an  equal  size  inverter.  We 
refer  to  such  a  cel!  in  what  follows  as  a  minimal  inverter. 

Now  inverters  in  the  latches  are  driven  through  pass  tran¬ 
sistors,  so  the  discussion  in  [1]  shows  that  we  should  choose  a 
pullup/pulldown  ratio  of  8.  The  time  required  for  the  second 
inverter  to  charge  its  load  is  therefore  approximated  by  the 
following  RC  constant: 

f  delay  ~  (^1  +  ^pass)  (Qoad  *  ^-pass) 

where  the  R' s  and  C’s  are  shown  in  Fig.  3.  Assuming  that 
the  pass  transistors  are  minimum  size,  R^  =  R  and  Cpas  -  C. 
Also  assuming  that  the  capacitive  load  (input  to  the  combina¬ 
tional  logic)  is  minimal,  we  get 

f  delay  =  2(K,//?  +  l)r 

=  2(Z.,/li'I  +  1 )  t 

where,  from  now  on,  we  express  resistance  in  terms  of  the 
length-to-width  ratio  of  the  transistor 

If  the  pullup/pulldown  ratio  of  the  latches  is  taken  to  be  8 
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(as  mentioned  above),  we  can  write  the  normalized  delay 
time  as 

/delav  !t  =  2(Sr  +  1) 

where  r-L2  'h'j  is  the  size  of  the  latch  pulldown.  When  r  =  \ 
the  pulldown  transistor  of  the  latch  inverter  will  be  twice  as 
wide  as  the  corresponding  transistor  of  the  minimal  inverter, 
but  the  pullup  pulldown  ratio  is  S,  not  4,  so  the  pullup  tran¬ 
sistor  will  then  be  the  same  length  as  in  the  minimal  invertei. 
The  area  of  such  a  latch  inverter  with  r-  \  w  ill  be  only  a  little 
larger  than  that  of  a  minimal  inverter,  perhaps  about  25  per¬ 
cent  larger.  Thus,  the  choice  of  r  =  \  speeds  up  the  latch  with¬ 
out  much  area  penalty,  and  we  will  use  this  value  in  this  paper, 
although  it  could  be  kept  as  a  parameter. 

Using  a  similar  argument  based  on  RC  charging  times,  the 
preset  time  is 

rseI/7  =  (Sr  +  !)(],>+  1 ). 

The  hr  term  comes  from  the  input  capacitance  of  the  second 
inverter,  which  loads  the  first  inverter.  To  see  this,  w  rite 

C^d  =  (L  i  H'j  'L  It')  C  =  ( It'j  L  i )  C  =  ( 1  !r)  C 

where  L2  -  L  =  It’  are  minimum  size. 

The  latching  area  is  easy  to  write  down.  Assuming  that  the 
pass  transistors  aie  the  same  size  as  minimal  inverters,  and  that 
the  latches  have  area  1.25 A,  each  two-phase  latch  requires  nor¬ 
malized  area 

-4 latch  =  2(1 .25  +  1)  =  4.5. 

IV.  Combinational  Logic  Timi  and  Spacl 

We  want  a  fairly  general  model  for  the  combinational  logic 
that  is  sandwiched  between  the  latches;  such  logic  may  be 
built  from  nanu  and  nor  gates,  pass  transistors,  or  some  com¬ 
bination  of  the  two.  We  will  assume  that  the  typical  logic 
stage  is  a  uniform  array  of  n  X  k  logical  elements ,  each  of 
which  has  an  area  /4e|cn,  and  a  delay  Teiem.  where 

■^elem  “  **^4 
and 

Tdem  “ 

This  array  will  be  thought  of  as  n  rows  by  k  columns,  with  a 
maximum  delay  path  from  left  to  right  of  k  elements.  Since 
logic  stages  are  not  usually  so  uniform,  the  a  and  (3  parameters 
must  represent  average  values  for  the  combinational  logic.  If 
gates  are  built  out  of  inverters  and  coupled  directly,  for  exam¬ 
ple,  (3  will  generally  be  determined  by  the  fan-out  factor  of 
the  logic  and  the  size  of  the  inverters.  An  average  fan-out  fac¬ 
tor  of  3,  using  gates  (with  a  pullup,  pulldown  ratio  of  4).  will 
result  in  12,  because  we  must  allow  for  the  worst  case  in 
the  propagation  of  logic,  where  all  signals  are  upgoing.  To  re¬ 
duce  this  to  a  value  closer  to  that  of  a  minima!  inverter,  we 
expect  to  increase  the  area  to,  say.  twice  that  of  a  minimal 
inverter.  Thus,  we  can  take  values  of  a  =  2  and  (3=4-12  as 
typical  of  combinational  logic  implemented  with  arrays  of 
gates.  We  should  also  note  that  the  value  of  a  should  be  se¬ 


lected  to  reflect  the  space  per  logical  element  required  for 
power  and  ground  lines. 

We  will  assume  that  the  nominal  circuit  has  one  typical  log u 
stage  between  a  pair  of  two-phase  latches,  and  we  then  con¬ 
sider  the  insertion  of  (m  -  1)  latches  equally  spaced  in  the 
combinational  logic,  m  >  1.  The  case  wi  =  1  then  represent 
the  original  situation.  We  assume  the  latches  can  be  made  to 
“fit”  well;  that  is,  that  the  combinational  logic  is  arranged 
regularly  enough  so  that  stages  can  be  pushed  apart  and  col¬ 
umns  of  latches  inserted.  The  total  time  required  for  the  logi. 
is  therefore 

ttopch-  &(klm) 

and  the  area 
>4  logic  /-4  ~  o dk 

where  d  =  n.'k  is  the  height-to-width  ratio  of  the  original  logic 
block,  another  dimensionless  parameter,  usually  assumed  to 
be  1. 

V.  On-Chip  Clock  Driver  Time  and  Space 

If  we  use  an  on-chip  clock  driver,  we  want  to  use  a  multi¬ 
stage  version  as  described  in  [1],  since  the  driver  will  ha\e  a 
large  capacitative  load,  especially  if  there  is  an  appreciable 
amount  of  intermediate  latching  introduced.  We  assume  that 
clock  distribution  is  on  metal,  so  that  propagation  delay  along 
the  wires  is  small.  Each  stage  is  assumed  to  have  a  pulldown  / 
times  the  size  of  the  preceding,  so  if  there  are  S  stages  driving 
)’  pass  transistors,  each  with  minimal  capacitance  C. 

f  -  Y'/s. 

If  we  start  the  clock  driving  with  a  minima!  inverter,  the  nor¬ 
malized  delay  of  such  a  driver  is  approximately 

f drive 't  ~  2.5/5. 

The  factor  of  2.5  results  from  averaging  the  pullup  time  of  4r 
and  pulldowm  time  r  along  the  inverter  chain.  (If  we  do  not 
insist  that  S  is  an  integer,  and  we  minimize  this  delay  with  re¬ 
spect  to  /,  we  get  the  value /=  e  [  1  ] .  But  S  is  an  integer.) 

This  estimate  for  delay  assumes  that  we  insist  on  a  globally- 
synchronized  clock-that  the  clock  signals  at  the  input  of  the 
driver  can  be  used  anywhere  else  without  concern  for  synchro¬ 
nization.  Caraiscos  and  Liu  [11]  have  pointed  out  that  the  rise 
and  fall  times  of  the  clock  waveforms  may  be  much  smaller 
than  the  absolute  delay,  and  that  using  a  local  clock  may  allow 
higher  throughput,  at  the  expense  of  using  local  clock  signa!> 
that  must  be  made  synchronous  with  the  signal  itself  at  diftci- 
ent  points  on  and  off  the  chip.  Sending  the  clock  along  with 
the  signal  will  incur  other  costs,  of  course.  (For  a  discussion  of 
the  virtues  of  a  globally-synchronized  clock  in  signal  process¬ 
ing,  see  [10]).  The  analysis  in  this  paper  is  conservative  in 
the  sense  that  the  resulting  degree  of  latching  and  increase  in 
throughput  is  on  the  low  side.  (We  can  avoid  the  area  and 
delay  penalty  incurred  by  using  an  on-chip  driver  by  moving 
the  clock  driver  off-chip.  That  case  will  be  discussed  in  more 
detail  in  Section  VII.) 
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V 


c 


We  must  also  consider  the  area  contribution  of  the  clock 
driver  in  relation  to  the  rest  of  the  circuit.  The  normalized 
area  of  the  driver  is 

/*  =  (¥-  1 )/(/-  1). 

1*0 


however,  that  now  the  optimal  value  of  m  will  occur  roughly 
near  the  breakpoint  where  (ik/m  =  2Tck)ck  +  t12,  and  that  these 
times  are  both  highly  uncertain  and  small  in  size.  The  analysis 
in  this  case  is  therefore  much  less  reliable,  and  much  more  sen¬ 
sitive  to  unmodeled  effects  such  as  propagation  delay,  than  in 
the  on-chip  clock  driver  case. 


Next  we  look  at  the  overall  time  and  space  requirements  of 
the  circuit. 

VI.  Optimization  of  AP  Product  with  an  On-Chip 
Clock  Driver 

We  can  now  write  the  total  minimum  normalized  period 
Pit  =  p  in  terms  of  our  parameters  as  follows: 

p  =  5/5  +  25  +  r2i  +  max  (Qki’m,  5/5  +  t12) 

where,  as  above, 

fs  -  Y  -  {m  +  1)  n  =  number  of  lines  driven 

and  r12  =  Z12/t,  t2j  =  r2J/T.  Similarly,  the  total  normalized 
area  areal  A  =  a  is 

a  =  2(Y-  !)/(/-  l)  +  4.5T  +  aJtM 

where  the  factor  of  2  accounts  for  the  fact  that  we  must  have 
two  drivers,  one  for  each  phase.  (These  can  be  combined  to 
some  extent,  but  the  total  aiea  is  still  nearly  twice  that  of  a 
single  driver.) 

We  now  have  the  function  ap(m,S),  where  m  and  5  are  dis¬ 
crete  parameters.  The  number  of  stages  is  never  much  larger 
than  In  Y,  since  the  optimal  choice  of  /  is  usually  around  e. 
In  most  cases  of  interest,  therefore,  it  suffices  to  take  the 
minimum  of  ap  for  5=  1,  •  •  • ,  16,  producing  what  we  call 
ap(m,  *): 

ap(m,  *)  =  minj  ap(m,  5). 

The  range  of  m  is  certainly  between  1  and  k,  so  the  optimal 
choice  of  m  can  be  determined  simply  by 

ap{*,  *)  =  rninm  ap(m,  *). 

The  gain  G  in  AP  product  achieved  by  latching  is,  therefore, 

G  =  ap(  1,  *)/ap(*,  *). 

VII.  The  Case  of  an  Off-Chip  Clock  Driver 
As  mentioned  in  Section  V,  if  we  allow  the  clock  driver  to 
be  off-chip,  we  can  drive  the  larger  capacitive  loads  incurred 
by  extra  latching  with  essentially  no  penalty  in  clock  delay  or 
driver  area.  The  normalized  period  and  area  can  then  be 
written 

P  =  2rclock  +  25.  +  r2,  +  max  (0*/m,  2rc|0ck  +  rJ2) 
a  *  4.5/  +  akn 

where  we  have  assumed  some  delay  of  rc)ock  =  rciock  /r  f°r  the 
clock  rise  and  fall  times.  The  ap  product  is  therefore  a  func¬ 
tion  of  only  one  unknown  parameter,  m. 

With  these  changes  in  a  and  p,  the  same  methodology  applies— 
a  numerical  example  will  be  given  in  the  next  section.  Note, 


VIII.  Numerical  Examples 

We  now  give  some  typical  numerical  results.  For  this  pur¬ 
pose,  we  consider  a  16-bit  array  multiplier,  implemented  by  an 
array  of  full  adders,  as  described,  for  example,  in  [2] .  We  also 
assume  that  the  full  adders  are  implemented  with  gates:  each 
full  adder  will  then  be  about  3  gates  deep.  The  carry  propaga¬ 
tion  will  require  an  array  that  has  a  maximum  depth  of  2  X  16, 
so  altogether  the  combinational  logic  will  have  k  ^  100.  (This 
is  consistent  with  the  value  of  “113  gate  delays”  given  in  [3] .) 
Say  that  each  gate  takes  about  double  the  area  of  a  minimal 
inverter  (a  2,  optimistic  for  area,  and  hence  pessimistic  for 
our  purposes),  and  that,  as  discussed  in  Section  IV,  /3  6.  The 

array  is  roughly  square,  so  that  d  **  1 .  Finally,  we  will  as¬ 
sume  that  clock  skew  is  not  an  important  problem,  and  take 
t12  =  t21  =  4- 

Fig.  4  shows  a  plot  of  normalized  period  p(m,  *)/p(  1,  *): 
normalized  area  a(m,  *)/a(l,  *);  and  normalized  AP  product 
ap(m.  *)/ap(l,  *)  versus  m.  The  period  as  a  function  of  m 
decreases  sharply  (roughly  as  1/m)  until  the  combinational 
logic  time  is  dominated  by  the  clock-swapping  and  dead  time 
(that  is,  until  riogic  »=  2rdock  +  r,2).  After  this  point  the  clock- 
driving  time  will  determine  the  minimum  clock  period  and  it 
no  longer  pays  to  increase  m,  because  the  area  will  increase 
with  no  payoff  in  speed.  The  minimum  value  of  period  occurs 
close  to  the  minimum  value  of  AP  product.  Thus,  in  theory, 
the  period  can  be  decreased  somewhat  from  its  value  when 
the  AP  product  is  minimized,  at  a  slight  cost  in  area.  In  prac¬ 
tice  the  optimal  values  are  almost  always  nearly  equal,  and 
sometimes  identical,  because  of  the  discreteness  of  the  param¬ 
eters  m  and  5. 

Fig.  5  shows  a  plot  of  gain  G  in  AP  product  versus  the  depth 
of  combinational  logic  k ,  for  the  values  o  =  2  and  0  =  4.  6,  8. 
1 2.  The  graph  shows  significant  gains  in  AP  product  (more 
than  2)  over  the  unlatched  case  when  it  >  50  and  (3  >  6.  Even 
when  the  gates  are  as  fast  as  a  minimal  inverter  (worst-case 
delay  factor  0  =  4),  there  is  an  AP  product  gain  of  2.2  when 
k  =  100.  Note  that  a  larger  value  of  a  would  only  improve  the 
gain. 

We  conclude  by  looking  at  the  actual  numerical  values  of 
the  minimum  clock  periods  and  areas  involved  in  this  analysis. 
Taking  the  it  =  1 00,  a  =  2,  0  =  6  case  above  for  a  hypothetical 
1 6-bit  array  multiplier,  and  assuming  r  =  0.3  ns  for  current  tech¬ 
nology,  we  get  a  period  of  P=  210  ns  with  no  intermediate 
latching,  and  an  optimal  period  of  />=66ns  with  m  =  6  (5 
intermediate  latching  stages). 

The  area  before  latching  is  2.11  X  1 04>1 ,  which  at  X  =  1 .5/a 
(3ji  line  width)  and  a  225X2  inverter  is  about  10.7  mm2.  After 
the  intermediate  latching,  the  area  becomes  12.1  mm2;  cer¬ 
tainly  a  modest  increase  in  area  for  about  a  threefold  increase 
in  speed. 
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Tip  4.  Normalized  period,  area,  and  AP  product  versus  m  for  a  =  2, 
0=6.*  =  100.  The  parameter  ( m  -  1 1  is  the  number  of  intermediate 
latching  stapes. 


Fip.  5.  Gain  in  AP  product  versus  combinational  logic  depth  k  for 

0  =  4.6.8,12.  The  parameter  0  is  the  delay  of  a  combinational  logic 

element,  normalized  in  terms  of  that  of  a  minima)  inverter. 

The  preceding  example  assumed  an  on-chip  clock  driver. 
When  we  use  an  off-chip  clock  driver  at  presumed  small  cost, 
as  discussed  in  Section  VI,  we  naturally  get  much  faster  solu¬ 
tions.  In  this  example,  the  optimal  value  of  period  with  the 
parameters  of  Section  VII  and  rC|OCy  =  4  (assuming  a  very 
sharp  clock  rise  time  and  fall  time),  minimizing  AP  product, 
is  18  ns,  compared  with  the  unlatched  value  of  191  ns.  The 
area  goes  from  10.6  mm2  w  ith  no  latching  to  16.5  mm2  with 
latching.  This  large  increase  in  area  reflects  a  corresponding 
increase  in  the  density  of  latching:  26  (m  =  27)  latching  stages 
arc  introduced.  We  emphasize  that  in  the  case  of  an  off-chip 
clock  driver,  the  numerical  values  of  the  parameters  j  and 
^vtoct  are  very  uncertain  and  the  optimal  values  of  period,  area, 
and  latching  density'  are  sensitive  to  these  parameters.  The 


large  predicted  speedups  in  possible  clock  rate  may  not  be 
realizable  in  practice. 

IX.  Conclusions 

We  have  modeled  the  timing  of  a  generic  pipelinable  VLSI  cir¬ 
cuit  in  which  there  are  combinational  logic  stages  separated  by 
latching  stages  driven  by  two-phase  clocks.  An  array  multiplier 
is  typical  of  such  a  configuration.  We  then  investigated  the 
effect  of  introducing  intermediate  latching  stages,  especially 
the  tradeoff  between  increased  throughput  and  increased  area 
Expressions  were  derived  for  area  and  minimum  clock  pciioJ. 
normalized  in  terms  of  minimal  inverter  area  and  delay,  and 
we  showed  that  optimal  choices  of  the  number  of  clock  driver 
stages  (S),  and  the  number  of  intermediate  latching  stage- 
(m  -  1),  can  be  made  by  simple  enumeration. 

The  numerical  results  illustrate  the  choice  of  latching  density 
in  a  typical  signal  processing  application.  According  to  om 
model,  a  16-bit  array  multiplier  with  gate  logic  and  an  on-chip 
multistage  clock  driver  can  be  clocked  about  three  times  fasts r 
with  about  a  13  percent  increase  in  area  using  five  intermediate 
latching  stages.  This  decrease  in  period  is  also  accompanied  by 
an  increase  in  the  latency,  or  delay,  of  the  multiplier. 

Higher  throughput  can  be  achieved  with  an  off-chip  clock 
driver,  but  the  parameters  in  that  case  are  less  well  known, 
and  at  such  speeds  the  model  becomes  less  reliable. 

Much  more  work  needs  to  be  done  on  detailed  modeling  of 
the  timing  of  such  VLSI  circuits  if  w  e  are  to  achieve  maximum 
throughput  rates  in  applications  like  signal  processing.  Future 
work  will  attempt  to  refine  our  model,  along  the  lines  of  [13] 
as  an  example.  We  also  need  to  study  propagation  delay .  w  luch 
was  assumed  to  be  relatively  small  in  the  examples  (4  times 
the  minimal  inverter  gate  delay  r  for  clock  distribution,  a  rea¬ 
sonable  assumption  if  the  clock  lines  are  metal,  for  example). 
Another  important  set  of  interesting  problems  concerns  the 
study  of  the  way  algorithms,  topologies,  and  layouts  interact 
with  the  timing  problems  considered  here.  Recent  work  on 
completely  pipelined  or  bit-level  systolic  arrays  is  a  start  in 
that  direction  (see,  for  example.  [2] ,  [4]  -  [9] ,  [  1 2] ). 
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1.  Overview 

This  is  a  short  collection  of  notes  on  the  latest  results  from  the  Massive 
Memory  Machine  (M3)  group.  Most  of  the  notes  concern  the  magnitude  of  the 
speedup  possible  with  massive  amounts  of  physical  memory.  We  are  greatly 
encouraged  by  the  results  recently  obtained,  and  are  of  course  eager  to  see  a  real 
M3  in  operation  soon. 

In  addition,  there  is  growing  industrial  interest  and  support  for  the  M3  con¬ 
cept.  As  we  reported  earlier,  DEC  is  very  enthusiastic  about  working  with  us  on 
a  large  memory  VAX,  as  well  as  on  the  ESP  architecture.  Furthermore,  our 
friends  at  DEC  have  just  told  us  that  DEC  will  soon  be  announcing  actual  pro¬ 
duct  VAX's  with  128Mb  of  memory.  While  this  is  not  the  256-5I2Mb  we  are 
planning,  it  is  exciting  to  see  that  they  are  thinking  along  similar  lines. 

There  are  also  two  new  groups  that  are  interested  in  M3.  The  first  is  a 
group  at  Bell  Labs  at  Murray  Hill.  They  would  like  to  build  an  M3  to  solve  cer¬ 
tain  phone  company  transaction  problems  that  very  high  speed  transaction  rates. 
Th  ese  could  easily  be  accomplished  on  a  1  MIPS  M3;  but  would  require  a  huge 
number  of  parallel  processors  if  the  data  were  stored  on  rotating  disks.  They  are 
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working  with  us  on  plans  for  a  small  prototype. 

The  second  group  is  at  IBM  Yorktown.  They  are  quite  interested  in  the 
whole  M3  concept;  they  found  out  about  M3  by  reading  our  recent  IEEE  publica¬ 
tion.  We  have  just  met  with  Dr.  Frank  Moss,  the  project  leader,  and  we  are 
planning  a  joint  two  day  meeting  in  a  few  weeks  in  Princeton.  We  hope  that 
through  such  meetings  we  can  work  out  a  strategy  for  formal  cooperation. 

1.1.  PROLOG  Studies 

PROLOG  is  widely  touted  as  the  language  of  choice  for  expert  systems 
research  and  development.  Consequently,  we  view  PROLOG  as  a  solid  basis  for 
experiments  in  applications  of  massive  memory.  The  early  returns  are  most 
encouraging:  we  are  discovering  general  techniques  for  speeding  up  PROLOG 
programs,  as  well  as  a  number  of  tricks  that  we  can  apply  in  special  cir¬ 
cumstances. 

1.2.  Program  Tracing 

In  order  to  better  understand  the  data  reference  patters  of  memory  intensive 
programs,  we  have  implemented  a  software  tracing  package.  It  is  being  used  to 
study  several  programs,  including  the  Clay  solver,  and  to  predict  their  running 
time  on  machines  of  various  memory  sizes  and  architectures.  The  trace  package 
uses  the  UNIX  debugging  facility  to  interrupt  the  program  under  analysis  after 
each  instruction  execution.  When  it  is  interrupted,  the  data  location(s)  being 
accessed  are  recorded. 

The  main  problem  with  this  package  is  that  is  slows  down  program  execu¬ 
tion  considerably,  roughly  by  a  factor  of  3000.  To  alleviate  this  problem,  we  are 
implementing  a  VAX  simulator  capable  of  producing  the  same  trace  information. 
Preliminary  experiments  with  the  simulator  indicate  that  it  is  10  times  faster 
than  the  original  package.  Although  this  still  represents  substantial  overhead, 
the  new  simulator  will  let  us  study  a  wider  range  of  programs. 


1.3.  ESP  Straw  Man  Prototype 

We  have  started  implementing  a  preliminary  version  of  a  ESP  machine.  The 
goal  is  not  to  create  an  operational  system,  but  to  gain  experience  with  the  ESP 
architecture  and  to  identify  some  of  the  practical  problems  that  may  arise  in  a 
full  implementation. 

We  have  acquired  two  8086-based  microprocessor  systems.  Each  CPI1  talks 
to  its  local  memory  via  a  multibus.  Each  CPU  is  also  connected  to  a  pair  of  disk 
drives.  We  have  designed  and  wire-wrapped  two  simple  ESP  controllers;  each 
controller  sits  on  one  of  the  multibuses.  The  controllers  are  tied  together  by  a 
simple  broadcast  bus.  The  controllers  make  no  provisions  for  failures  or  errors. 

We  have  started  debugging  the  ESP  controller  hardware,  and  are  only  begin¬ 
ning  to  design  the  software  that  will  run  the  machines.  Even  at  this  early  stage, 
our  implementation  effort  has  already  turned  up  several  important  issues  that 
had  been  overlooked  in  the  original  paper  design.  These  issues  include  system 
startup,  I/O  and  interrupt  handling,  periodically  refreshing  dynamic  memory, 
and  queueing  data  words  at  each  ESP  controller;  we  are  now  studying  these 
issues.  Some  of  them  can  be  safely  ignored  in  our  prototype  (e.g.,  memory 
refresh);  to  cope  with  others,  we  are  adding  more  capabilities  to  our  ESP  controll¬ 
ers. 

1.4.  M3  Performance  on  Database  Benchmarks 

Two  recent  papers  have  compared  the  performance  of  several  database 
machines,  and  we  decided  to  evaluate  the  performance  of  an  M3  database 
machine  on  the  same  queries  that  were  used  for  the  benchmarks.  Our  prelim¬ 
inary  results  are  given  in  an  attached  report. 

In  summary,  our  results  clearly  show  that  an  M3  that  can  hold  all  of  the 
database  in  fast  memory  can  outperform  the  database  machines  considered.  The 
speedups  range  from  a  factor  of  7,  to  a  factor  of  27,000,  depending  on  the 
assumptions  made  and  the  sample  queries  analyzed.  These  results  must  of  course 
be  treated  with  caution,  but  they  do  illustrate  that  memory  can  be  an  extremely 
useful  resource  for  database  applications. 


2.  Massive  Memory  vs.  Massive  Parallelism 


2.1.  Introduction 

A  common  “folk  principle’’  is  that  massive  parallelism  is  the  only  way  to 
vastly  speed  up  computations.  In  contrast,  we  will  show  that  there  are  important 
classes  of  computations  which  can  be  greatly  sped  up  only  by  massive  amounts  of 
physical  memory.  Thus,  for  these  computations  an  M3  will  vastly  outperform 
any  parallel  machine! 

On  the  face  of  it  our  claim  seems  absurd.  Don’t  parallel  machines  always 
dominate  sequential  machines  such  as  an  M3?  In  order  to  understand  this 
apparent  paradox  let  us  examine  the  standard  argument  more  carefully.  .Assume 
that  some  task  has  an  algorithm  A  that  takes  time  T(A,n)  for  inputs  of  length  n. 
Then  potentially  p  parallel  processors  can  run  this  algorithm  in  time  T(A. r?)/p. 
Of  course  this  is  the  upper  limit  on  the  potential  performance  of  p  parallel  pro¬ 
cessors;  in  practice  fully  linear  speed  up  is  rare.  However,  to  make  our  point 
about  the  power  of  memory  over  parallel  processor  even  more  dramatic,  let's 
assume  that  such  speedup  is  always  possible. 

Since  T(A.n)  >  T(A.n)/p  for  any  p.  how  can  parallel  processors  ever  lose  to 
a  sequential  machine  such  as  M3?  The  answer  is  that  there  may  be  a  new  algo¬ 
rithm  B  for  which  T(B,ri)  <  T(A, n)/p  for  any  reasonable  p.  Moreover,  this  algo¬ 
rithm  may  require  in  an  essentia!  way  vast  amounts  of  random  access  memory; 
thus,  this  algorithm  cannot  be  executed  on  the  p  parallel  processors  for  lack  of 
physical  memory.  In  this  way  it  is  possible  for  an  M3  to  greatly  outperform  any 
collection  of  parallel  processors.  Note,  we  are  not  saying  that  memory  is  always 
better  than  processors,  this  is  false;  but  then  so  is  the  folk  principle  that  parallel 
processors  are  always  better  than  sequential  machines.  We  are  simply  pointing 
out  that  there  are  memory  intensive  computations  that  benefit  much  more  from 
memory  than  from  processors. 

A  possible  counter  to  our  argument  is:  why  can’t  the  parallel  processors  have 
enough  space  to  use  the  better  algorithm?  Of  course  in  principle  they  can.  The 
key  point  is  that  on  many  interesting  problems  we  will  not  be  able  to  alTord  both 
parallel  processors  and  massive  amounts  of  memory.  On  problems  that 


fundamentally  require  memory,  not  processors,  the  parallel  processors  will  be 
forced  to  run  a  slower  algorithm,  and  hence  be  outperformed  by  an  M3. 

We  now  demonstrate  our  claims  with  two  examples  from  PROLOG,  an 
important  language  for  a  wide  variety  of  non-numeric  computations.  There  is 
currently  an  intensive  international  effort  to  use  parallelism  to  speed  up  PRO¬ 
LOG.  It  may  therefore  be  interesting  to  see  how  massive  amounts  of  memory 
can  be  used  to  achieve  vast  speedups  in  PROLOG. 

2.2.  Recursion 

The  first  application  of  memory  is  conveniently  introduced  by  way  of  a  sim¬ 
ple  PROLOG  example: 

path([A|X].[B|Y])  edge(A.P). 
pat h( [  |X].Y)  path(X.Y). 
path(X.[-|Y])  path(X.Y). 

(Here  edge(.)  is  some  relation  that  is  defined  by  other  rules.)  Path(X,Y)  checks 
the  two  lists  X  and  Y  to  see  if  there  is  an  element  in  the  first  list  with  an  edge  to 
an  element  in  the  second  list.  Intuitively,  we  would  expect  that  this  process 
should  take  time  quadratic  in  n,  the  total  number  of  elements.  However,  on  any 
standard  PROLOG  it  takes  exponential  time,  because  PROLOG  repeatedly  re¬ 
evaluates  subgoals.  While  there  are  at  most  quadraticallv  many  subgoals,  they 
are  evaluated  exponentially  many  times. 

For  those  unfamiliar  with  PROLOG’S  evaluation  scheme,  let  us  examine  the 

computation  of  path(X,Y)  in  more  detail.  Here  X  is  a  list  xx . xk  and  Y  is  a 

list  y{,  .  ,  .  ,yt.  with  k+l=n.  Path(X,Y)  is  computed  as  follows: 

(1)  If.  either  list  is  empty  then  path(X,Y)  is  false. 

(2)  Next,  if  edge(jj.  y})  is  true  then  path(X.Y)  is  true. 

(3)  Finally,  if  either  path(X',Y)  or  path(X.Y’)  are  true  then  so  is  path(X.Y). 

Here  X'  is  equal  to  /o . xk  and  Y’  is  equal  to  jj, . yt. 

Note,  the  last  part  of  the  computation  is  the  key  to  the  use  of  repeated  subgoals. 
The  call  to  path(l’.Y)  where  U  is  equal  to  /, . xk  and  V  is  equal  to  j/; . y, 

occurs  exactly  ( times. 


Let  us  now  compare  the  performance  of  a  set  of  p  parallel  processors  on  this 
example  and  an  M3.  The  p  parallel  processors  take  2"/p  time  since  the  usual 
PROLOG  implementation  checks  that  many  subgoals  on  this  problem.  On  the 
other  hand,  an  M3  can  use  the  following  strategy:  cache  all  subgoals  and  use 
table  lookup  instead  of  re-evaluating  subgoals.  This  strategy  leads  to  an  algo¬ 
rithm  that  takes  order  n~  time,  since  each  subgoal  is  checked  exactly  once  Thus, 
even  for  modest  sized  problems  (n  equal  to  40)  the  number  of  parallel  processors 
required  to  perform  as  well  as  the  M3  is  on  the  order  of  one  billion! 

A  final  word  about  this  example:  it  is  of  course  always  possible  to  create 
examples  that  make  any  approach  look  good.  We  feel,  however,  that  using 
memory  to  avoid  repeated  re-evaluation  of  subgoals  is  a  fundamental  technique 
to  speed  up  PROLOG.  Exponential  growth  cannot  simply  be  waved  away:  there 
are  many  natural  PROLOG  examples  that  lead  to  the  same  combinatorial  explo¬ 
sion.  A  PROLOG  machine  with  a  huge  memory  to  cache  millions  or  even  billions 
of  subgoals  would  be  extremely  powerful. 

2.3.  Table  Lookup 

A  second  critical  use  of  memory  to  speed  up  PROLOG  relies  on  the  way  the 
PROLOG  data  base  is  searched.  In  order  to  reach  a  goal  PROLOG  searches  its 
rules  for  the  first  one  that  matches  the  current  goal.  While  there  are  a  number  of 
ways  to  speedup  this  search,  the  fastest  one  appears  to  use  large  amounts  of 
extra  memory.  The  idea  is  simple:  in  additional  to  storing  the  rules,  we  store 
indices  (inverted  lists)  that  make  the  search  very  fast.  With  the  proper  data 
structures  a  constant  time  search  independent  of  the  size  of  the  data  base  is  pos¬ 
sible.  Clearly,  no  number  of  parallel  processors  could  outperform  such  an  imple¬ 
mentation. 

We  have  performed  a  number  of  experiments  to  validate  this  claim.  Our 
experiments  so  far  have  consisted  of  comparing  the  standard  implementation  of 
PROLOG  with  ones  that  use  the  data  structures  described  above.  One  test  pro¬ 
gram  is  a  simple  PROLOG  program  that  computes  the  transitive  closure  of  a 
directed  graph: 


reach(X,Y)  edge(X,Y). 
reach(X,Y)  edge(X,Z),  reach(Z,Y). 

(Again,  edgo(,)  is  a  relation  that  is  defined  by  other  rules.)  Table  1  contains  the 
actual  results  of  experiments  on  a  VAX  11/750.  The  speedups  are  dramatic:  even 
on  modest  sized  graphs  we  get  several  orders  of  magnitude  speedup.  The  reason 
for  these  large  speedups  is  that  the  parallel  approach  takes  order  rr/p  time  and 
the  memory  intensive  M3  approach  takes  only  order  n  time.  Since  n  reflects  the 
size  of  the  data  base,  the  potential  for  speedups  large  data  bases  is  immense. 


Number  of  Edges 

Number  of  Queries 

PROLOG 

(secs) 

M3  -  PROLOG 

(secs) 

78 

15G 

224.3 

0.5 

GO 

380 

252.1 

1.0 

100 

380 

3525.7 

1.2 

100 

870 

3700.8 

2.8 

120 

1190 

18375.8 

4.8 

100 

2450 

6450.6 

9.8 

1G5 

2450 

? 

12.2 

TABLE  1 


Results  of  comparison  of  PROLOG  and  M3-based  PROLOG  implementation. 

All  times  in  VAX  11/750  seconds. 


3.  M3  Performance  on  Certain  Database  Benchmarks 

In  recent  papers  by  Hawthorn  and  DeWitt  [1],  and  by  Hilly er,  Shaw,  and 
Nigam  [2],  the  performance  of  several  database  machines  was  compared.  In  this 
note,  we  study  the  performance  of  an  M3  database  machine  on  the  same  queries 
under  comparable  assumptions. 

A  M3  is  not  a  “conventional”  database  machine,  so  we  must  clarify  a  few 
points  before  starting  our  comparison.  The  basic  premise  of  the  M3  project  is 
that  fast,  semiconductor  memory  will  soon  be  inexpensive  enough  so  that  many 
important  databases  (e.g. ,  dozens  of  gigabytes)  will  fit  inside  main  memory. 
When  this  occurs,  it  may  be  more  cost  effective  to  build  a  conventional  machine 
with  a  massive  memory,  rather  than  building  a  machine  with  parallel  search  ele¬ 
ments  but  with  insufficient  memory  to  hold  the  entire  database.  Thus,  in  our 
comparisons  we  will  assume  that  the  database  fits  within  the  M3  memory,  but  it 
does  not  fit  in  a  machine  where  resources  were  invested  in  parallel  processing  ele¬ 
ments.  The  memory  size  /  processor  speed  tradeoffs  are  discussed  in  detail  in  [3], 

The  query  times  in  [1,2]  are  divided  into  query  processing  (or  compiling) 
time,  the  actual  database  search  time,  and  the  time  to  transmit  the  answer  back 
to  a  host  machine.  In  this  note  we  only  study  the  database  search  time  because 
the  other  times  will  be  roughly  the  same  in  M3  and  other  database  machines. 
Furthermore,  we  compare  the  M3  only  to  the  NONWON  [2],  the  fastest  of  the 
database  machines. 

The  M3  processor  speed  plays  a  very  important  role  in  the  evaluation.  To 
be  conservative,  we  assume  that  the  M3  has  a  1  MIPS  processor.  However,  at 
the  end  of  this  note  we  briefly  consider  the  effect  of  a  10  MIPS  processor,  noting 
that  this  value  is  still  very  reasonable. 

3.1.  Query  #1 

This  query  is  a  select  over  a  relation  with  1,110  tuples.  Each  tuple  is  127 
bytes  long  The  search  key  for  the  select  is  12  bytes  long.  The  answer  consists 
of  3  tuples,  but  only  21  bytes  of  each  one  are  required  for  the  answer.  The 
NON- VON  search  time  for  this  query  is  between  0.0827  (best  case,  dat3  on  disk) 
and  0.1007  (worst  case,  data  also  on  disk)  seconds.  (Using  the  parameters  of  [2], 
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this  is  0\10  +  BOOM  +  DAVAC  +  DROT.) 

For  M3,  the  search  time  depends  on  the  data  structures  available  for  the 
relation.  For  a  sequential  scan,  we  must  examine  each  of  the  tuples.  Assuming 
that  it  takes  10  machine  instructions  to  examine  a  tuple  (the  key  is  12  bytes  or  4 
words),  this  will  take  0.011  seconds  on  a  1  MIPS  machine.  If  a  binary  tree  exists 
for  the  relation  (and  one  of  the  premises  of  M3  is  that  there  will  be  enough 
memory  to  hold  auxiliary  structures  for  the  important  search  fields),  the  time  can 
be  reduced  considerably.  The  search  would  involve  going  down  the  tree  (11  levels 
maximum  and  10  instructions  to  examine  each  node),  and  extracting  pointers  to 
the  three  matching  records  (20  instructions),  for  a  total  of  130  microseconds.  If  a 
hash  table  exists,  we  would  simply  need  to  hash  on  the  key  and  extract  the 
pointers.  Assuming  10  instructions  per  pointer,  this  would  take  30  microseconds. 

In  summary,  comparing  against  the  best  NON-VON  times,  the  M3  could 
provide  anywhere  from  a  7  fold  speedup  (sequential  search  for  M3)  to  a  2700  fold 
speedup  (hash  table  lookup  for  M3).  It  is  interesting  to  note  that  if  we  assume 
that  NON-VON  has  all  of  the  data  in  memory  (which  may  not  be  fair  since  we 
aie  giving  NON-VON  both  a  large  memory  and  parallel  search  elements),  it  still 
does  not  beat  an  M3  that  uses  hashing.  In  this  case,  both  search  times  are  com¬ 
parable  (40  microseconds  for  NON-VON;  30  for  M3). 

3.2.  Query  #2 

The  second  query  is  a  select  of  one  relation,  followed  by  a  join  of  the  result 
with  a  second  relation.  The  first  relation  contains  282  (52-byte)  tuples.  The 
selection  yields  22  tuples.  The  second  relation  contains  11,436  (127-byte)  tuples. 
The  join  field  is  20  bytes  long,  and  422  tuples  are  produced  by  the  join.  The 
NON-VON  search  times  for  this  query  are  0.336  (best  case,  data  on  disk)  to 
0.4667  (worst  case,  data  also  on  disk)  seconds. 

The  M3  search  times  again  depend  on  the  data  structures  available.  If  none 
are  available,  we  must  first  scan  the  first  relation  (282  tuples  at  10  instructions 
per  tuple).  For  each  of  the  matching  22  tuples,  we  must  set  up  a  sequential  scan 
of  the  second  relation  (20  instructions,  say),  and  then  scan  (11,436  tuples  at  15 
instructions  each).  (Each  check  takes  15  and  not  10  instructions  as  we  had 
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assumed  earlier  because  the  join  field  is  longer.)  For  each  of  the  resulting  422 
tuples,  suppose  we  perform  10  additional  instructions.  Adding  this  up  we  obtain 
about  3.8  million  instructions,  or  3.8  seconds  on  a  1  MIPS  machine. 

However,  if  we  construct  a  hash  table  to  aid  in  the  join  we  can  reduce  this 
time  considerably.  If  we  assume  it  takes  20  instructions  to  insert  the  key  of  each 
tuple  of  the  second  relation  into  a  hash  table,  then  228,720  instructions  will  build 
the  table.  To  check  if  each  of  the  22  keys  resulting  from  the  select  exist  in  the 
table  takes  only  22  times  say  20  instructions.  As  before,  we  include  2820  instruc¬ 
tions  to  do  the  initial  select,  plus  4220  instructions  to  process  the  resulting  tuples. 
This  gives  us  a  total  search  time  of  0.24  seconds. 

If  search  structures  already  exist  for  the  second  relation,  then  of  course  the 
time  can  be  further  reduced.  For  example,  if  a  binary  tree  exists  for  the  join 
field,  the  join  involves  looking  up  22  keys  (14  levels  of  the  tree  times  15  instruc¬ 
tions  at  each  node).  Adding  this  to  the  select  time  and  the  time  to  process  the 
422  results,  we  get  a  total  search  time  of  0.012  seconds. 

In  summary,  without  auxiliary  data  structures  M3  will  be  about  11  times 
slower  than  NON-VON  (best  time).  However,  if  M3  is  allowed  to  build  its  data 
structures,  it  can  be  1.4  times  faster  than  NON-VON.  If  the  structures  are 
already  in  place,  the  speedup  is  greater:  28  times. 

3.3.  Query  #3 

The  last  query  examines  a  relation  with  194  (256-byte)  tuples.  The  values  in 
a  given  field  (encumb.  4  bytes)  are  to  be  added  for  each  group  of  tuples  that 
match  in  a  second  field  (acct-fund,  8  bytes).  There  are  17  unique  values  of  the 
acct-fund  fields.  The  NON-VON  search  times  are  0.088  (best  case,  data  on  disk) 
and  0.11  (worst  case)  seconds. 

On  an  M3  we  would  always  have  to  scan  the  entire  relation,  i.e.,  194  tuples 
times  say  10  instructions  per  tuple.  The  results  can  be  collected  by  building  a 
linked  list,  where  each  element  contains  the  current  sum  for  a  given  acct-fund 
value.  To  add  each  new  value,  we  must  scan  the  list  to  find  the  proper  record. 
Since  there  will  be  at  most  17  records,  a  scan  will  take  on  the  average  9  records, 
at  say  10  instructions  each.  Thus,  each  insertion  takes  90  instructions,  and  this 


must  be  multiplied  by  the  194  tuples  that  exist.  The  total  time  is  then  0.019 
seconds. 


Since  there  are  so  few  records  in  the  linked  list  of  partial  sums,  changing  this 
data  structure  does  not  bring  large  improvements.  For  example,  with  a  B*tree  (5 
levels  maximum),  each  insertion  will  take  roughly  50  instructions,  for  a  total  time 
of  0.012  seconds. 

Comparing  these  numbers  to  the  NON- VON  times,  we  see  that  M3  is  a  fac¬ 
tor  of  4  to  7  times  faster  on  this  query. 

3.4.  Conclusions 

Our  rough  estimates  clearly  indicate  that  M3  can  provide  significant  speed- 
ups  for  the  sample  queries  of  [1.2].  To  summarize  the  results,  we  present  the  fol¬ 
lowing  table  that  gives  the  M3  speedup  (i.e.,  the  NON- VON  search  time  divided 
by  the  M3  search  time)  for  the  case  where  search  structures  and  data  are  avail¬ 
able  in  M3  memory,  and  data  is  on  di":  in  NON-VON.  We  also  give  the 
speedup  attainable  if  the  M3  processor  ran  at  10  MIPS. 


M3  Speedup 

1  MIPS  Processor 

10  MIPS  Processor 

Query  #1 

2,700 

27,000 

28 

280 

Query  #3 

7 

70 

As  Hilly  or.  Shaw,  and  Nigam  [2]  state,  “There  are  hazards  in  attempting  to 
deduce  the  relative  merit  of  alternative  architectures  based  on  ‘paper  and  pencil' 
analysis  of  performance  on  a  small  number  of  specific  problems  with  specified 
data"  We  certainly  agree  with  them:  the  results  we  have  presented  must  be 
treated  with  caution.  However,  we  do  feel  that  they  illustrate  that  memory  can 
be  an  extremely  useful  resource  and  can  provide  impressive  speedups.  even  when 


the  competition  is  a  powerful  database  machine  like  NON- VON. 
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